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ABSTRACT 


An ensemble prediction system (EPS) generates flow-dependent estimates of 
uncertainty (i.e., random error due to analysis and model errors) associated with a 
numerical weather prediction model to provide information critical to optimal decision 
making. Ambiguity, or uncertainty in the prediction of forecast uncertainty, arises due to 
EPS deficiencies, including finite sampling and inadequate representation of the sources 
of forecast uncertainty. An EPS based on a low-order dynamical system was used to 
investigate the behavior of ambiguity, validate two practical estimation methods against a 
theoretical (impractical) technique, and apply ambiguity in decision making. Ambiguity 
generally decreased with increasing lead time and was found to depend strongly on 
ensemble forecast variance and the variability of ensemble mean error. The practical 
estimation techniques provided reasonably accurate ambiguity estimates, although they 
were too low at early lead times. The theoretical ambiguity estimate added significant 
value when combining ambiguity with forecast uncertainty to provide a single normative 
decision input. Additionally, value added to secondary user criteria (e.g., minimizing 
repeat false alarms), was explored using the practical estimations. Repeat false alarms 
were significantly reduced while maintaining primary value by using ambiguity 
information to selectively reverse normative decisions to take protective action, which 
effectively redistributed negative outcomes. 
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Figure 1. Three simulated attempts to represent the foreeast PDF using an eight 
member “perfeet model” ensemble. The foreeast PDF (solid) being 
sampled is (0,1), while the realized ensemble PDF (dashed) is normal 

with parameters values ealculated based on random ensemble members 
(a) mean and varianee elose to true values, (b) negatively biased mean 
and varianee too small, (c) mean elose to true and varianee too large. 


Vertieal lines represent the loeation of ensemble members.27 

Figure 2. Sampling distributions of the (a) standardized error in ensemble mean and 
(b) fraetional error in ensemble spread, dependent on the number of 
ensemble members. Results are shown for ensemble sizes of 10, 20, 40 

and 80 members (labeled) [From Eekel and Allen 2009].28 

Figure 3. Optimal value seore aeross the range of C/L values. The value seore for 
eaeh C/L is ealeulated using the C/L as the deeision threshold. The 
elimatologieal rate of oeeurrenee (o ) is 29.5%.29 


Figure 4. Ambiguity distribution overlap in the C/L seenario. The hatched area 
represents the overlap of the ambiguity distribution beyond the C/L (blue 
line), which would result in a different decision than that found using the 

best-guess or control forecast probability (red line).29 

Figure 5. Histogram of possible first- and second-order uncertainty associated with 
some event used for calculating the uncertainty-folding forecast 


probability estimate (). As an example, the bin of forecast probability 
values 44% < p^< 45% (arrow) has a relative frequency of 5%, thus 

contributing 44.5% x 5% = 2.23% to the summation in Equation (6).30 

Eigure 6. Lorenz 96 System schematic with 8 resolved variables (large circles) and 
256 unresolved variables (small circles). The unresolved variables are 
grouped with the resolved variable to which they belong in sets of 32 
[Erom Wilks 2005].83 


Eigure 7. Scatterplot of the unresolved tendency U from all resolved variables as a 
function of the resolved variable. The fourth-order polynomial regression 
best-fit (solid line) is the deterministic portion of the parameterization. 
The average variance of U across all X values about the best-fit line is 


used for the stochastic portion of the parameterization.84 

Eigure 8. Probability density of resolved (A^^) variable using (a) L96 System, (b) 
L96 Model with deterministic parameterization, and (c) L96 Model with 

stochastic parameterization.84 

Eigure 9. Multi-model EPS deterministic parameterizations. The solid line is the 
deterministic portion of the stochastic parameterization shown in Eigure 7. 
Dashed lines are static deterministic parameterizations, where each is 
associated with a specific ensemble member. Only ten members are 
shown for clarity.85 
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Figure 10. Error variance diagram using L96M deterministic and ensemble forecast 

data from 24,000 forecast-observation pairs.86 

Figure 11. Dispersion Diagram using uncalibrated L96M EPS forecast data from 

24,000 forecast-observation pairs.87 

Figure 12. Dispersion diagram using calibrated E96M EPS forecast data from 24,000 

forecast-observation pairs.87 

Figure 13. Verification rank histograms using uncalibrated E96 EPS ensemble 
forecast data from 24,000 forecast-observation pairs for various forecast 
lead times. The solid red line indicates the uniform probability of any 


rank given a 21-member ensemble. The dashed red lines are the bounds of 
the 95% Cl about the uniform probability given the number of ensemble 


forecasts (M). (Continued, next page.).88 

Figure 14. Verification rank histograms using calibrated E96 EPS ensemble forecast 
data from 24,000 forecast-observation pairs for various forecast lead 

times. Same as Figure 13.90 

Figure 15. Comparison of Verification Outlier Percentage (VOP) values based on the 
uncalibrated (solid) and calibrated (dot-dash) F96 EPS ensemble forecast 
data from 24,000 forecast-observation pairs. The perfect VOP-line of 

0.26% is shown by the dotted line.92 

Figure 16. Brier skill score (BSS) for the common event using uncalibrated F96 EPS 
ensemble forecast data from 24,000 forecast-observation pairs. Error bars 
created using bootstrap resampling represent the 95% Cl about the BSS 

value at each forecast lead time. The dashed line is the zero-skill line.92 

Figure 17. BSS for the common event using calibrated E96 EPS ensemble forecast 

data from 24,000 forecast-observation pairs. Same as Figure 16.93 

Figure 18. Comparison of (a) reliability and (b) resolution components of BSS for 
both uncalibrated (blue solid line) and calibrated (red dashed line) for the 

common event.94 

Figure 19. BSS for the rare event using uncalibrated F96 EPS ensemble forecast data 

from 24,000 forecast-observation pairs. Same as Figure 16.95 

Figure 20. BSS for the rare event using calibrated F96 EPS ensemble forecast data 

from 24,000 forecast-observation pairs. Same as Figure 16.95 

Figure 21. Comparison of (a) reliability and (b) resolution components of BSS for 
both uncalibrated (blue solid line) and calibrated (red dashed line) for the 

rare event.96 

Figure 22. Uniform Ranks method. Calculating forecast probability for X>5.0 
using a 10-member ensemble. The probability value of 77% is 

represented by the hatched area [After Szczes 2008].96 

Figure 23. Postprocessing steps for F96 EPS Data.97 

Figure 24. F96M EPS EoE Schematic. After the random starting state is determined, 
this state is integrated forward through the data assimilation and forecast 
periods using the F96S. The process inside the dashed box is repeated N 
times using the F96M with the same random initial state to generate the 
EoE constituents.98 
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Figure 28. 


Figure 29. 
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Figure 31. 


Figure 32. 


Figure 33. 


Example comparison of a true and an ensemble forecast PDF (a) and CDF 
(b) defined as N(2.2°C, 2.6°C) and N(2.S°C, 1.8°C) respectively. An error 
of-13.9% in pe for the chance of temperature < 0°C is the difference in 
the PDFs’ shaded areas, or the difference in the two CDFs (double arrow) 

[From Eckel and Allen 2009].99 

(a) Error in for a range of temperature values for the event threshold, 
calculated as the difference in the two CDEs of Eigure 25. The top axis is 
the nonlinear p^ scale, (b) Plot of Ps vs. true forecast probability (solid), 
where the dashed line indicates perfect correlation [Prom Eckel and Allen 

2009].100 

Histogram and fitted PDPs of results from an example bulk-calibrated 
ensemble forecast dataset for (a) error in ensemble mean, (b) fractional 
error in ensemble spread, and (c) ensemble spread. The data are based on 
statistics from the JM 51-member EPS. The domain and forecast period 
are the same as described in Chapter IIEP. [From Eckel and Allen 2009]. ...101 


Scatter plots showing relationships between the variables in Eigure 27. 
Correlation coefficient (r) is inset in each plot [From Eckel and Allen 

2009].101 

Relationship of ensemble spread with variability (standard deviation) of 
(a) ensemble mean error and (b) fractional error in ensemble spread. Solid 
line in each plot indicates the standard deviation of the error distributions 

in Eigure 27 (a) and (b). [After Eckel and Allen 2009].102 

True forecast probability for five sets of random draws from the PDPs in 
Eigure 27 where each curve is labeled with its associated ensemble mean 
error, ensemble spread error and ensemble spread. The five possible 
values of true forecast probability (marked by dots) for a pe of 55% are 
79.1, 69.6, 52.4, 51.3, and 46.7% [After Eckel and Allen 2009].102 


Histogram of 50 000 sample values of true forecast probability for 
calibrated ensemble forecast probability of (a) 55.0%, (b) 11.0%, and (c) 
94.0% generated from random samples from the PDPs in Eigure 27. Each 
histogram is centered on the Pe value from which it was generated since 
the ensemble forecast PDPs were calibrated. The 5*’^ and 95*’’ percentile 
values of true forecast probability (for use in Eigure 32) are indicated by 
p^ and p^^ [From Eckel and Allen 2009].103 

CES ambiguity for all calibrated forecast probability values. After 
repeated sampling, the 5”’ and the 95*’’ percentiles of the possible true 
forecast probability values ( p^ and p^^) represent ambiguity as a 90% Cl 

about the expected true value (dashed line) for calibrated pe [After Eckel 

and Allen 2009].104 

CES ambiguity for all calibrated forecast probability values using a set 
ensemble spread. Similar to Eigure 32 but for specific values of ensemble 
spread rather than all possible values, but still based on the error 
distributions in Eigure 27 (a) and (b). The thin (thick) curves show the 
ambiguity for an ensemble spread of 2.0°C (6.0°C) [Prom Eckel and Allen 
2009].105 
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Figure 35. 


Figure 36. 


Figure 37. 
Figure 38. 


Figure 39. 


Figure 40. 


Figure 41. 


Ambiguity distributions produced by bootstrap resampling of simulated 
ensemble forecast data (not shown) for (a) An example, perfect 30- 
member forecast, simulated by 30 random draws from the true PDF in 
Figure 25 and (b) An example, perfect 80-member forecast simulated 
using the same true PDF as in (a). The original forecast probability (p^), 

and P 95 (5* and 95* percentiles that define total ambiguity), and Pj. 
(true forecast probability) are labeled. Total ambiguity values are 17.8% 
for (a) and 12.4% for (b). Notice that pe ends up as the distribution’s 

central value [After Eckel and Allen 2009].106 

Error distributions of (a) mean error in the ensemble mean and (b) 
fractional error in ensemble spread. The solid lines are the original, 
uncalibrated error distributions for the JM 2-m 5-day temperature 
forecasts. The dashed lines give the reduced error distributions, where the 
error variance associated with finite sampling (for 51-members) has been 
removed. The reduced error distributions are used to draw random 

calibration coefficients during RCR.106 

Example RCR ambiguity distributions using (a) fixed, bulk calibration on 
each resample and (b) random calibration on each resample for the JM 5- 
day 2-m temperature forecast for a single grid point and date. Note that 
the random calibration produces a wider ambiguity distribution [After 


Eckel and Allen 2009].107 

Post-processing steps for ambiguity data for the three estimation 

techniques.108 

Iterative-bisection method used to converge on the A-value giving the 
expected value of EoE constituent or RCR resampled p^ values equal to 

some desired p* value.109 


Integrated optimal VS (lOVS) example for the control forecasts at a single 
forecast lead time, (a) The optimal VS is computed using the 800 control 
forecast probability values at r = 2.6 . The positive area under the curve is 
computed using Equation (27) by summing the area of intervals (gray 
regions) from C/L 0-1 using a Ar of 0.01. (b) The Ay of each interval’s 
area is the optimal VS at the center of the interval (e.g., for the interval 
0.51-0.52, Ay is the optimal VS at C/L = 0.515). An interval’s area is 

taken as zero if the optimal VS^O . 110 

Elowchart of decision process for the repeat false alarms secondary criteria 
scenario using the ambiguity distribution overlap. Tallying indicates 
filling in the contingency table (Table 2, page 31) for the current decision 
rule (C/L). The setting of the repeat false alarm flag determines the 
outcome of the “Previous forecast PA” decision point, where a set flag 

equals Y.Ill 

Overlap threshold conceptual model as a function of C/L for the repeat 
false alarm secondary criteria value testing scenario. 112 
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Figure 42. Flowchart for determining empirical secondary criteria overlap threshold 
value. Performed for each C/L, testing compares the metrics derived using 
the control forecast probability versus using the current overlap threshold. .112 
Figure 43. Reliability diagrams for raw and calibrated NCEP GEFS forecasts based 
on the training dataset with 102,060 forecast-observation pairs. The 
reliability diagrams for the (a) raw and (c) calibrated data used 11 forecast 
probability bins (0-0.05, 0.05-0.15, 0.15-0.25,..., 0.95-1.0) where the 
average forecast probability with each bin is used as the bin’s 
representative value. Error bars represent the 95% binomial Cl (Wilks, 
2006). The dashed line indicates perfect reliability, while the dotted line 
shows the sample climatology. The bin usage histograms for the (b) raw 
and (d) calibrated data give the number of forecast probabilities falling in 

each of the 11 bins.113 

Eigure 44. Reliability diagram for raw and calibrated NCEP GEES forecasts based on 
the independent application dataset with 50,220 forecast-observation pairs. 

Same as Eigure 43.114 

Eigure 45. Sample CES^ NCEP GEES 21-member EPS ambiguity distributions 
created using error statistics in Table 7. The histograms show the relative 
frequency of Pj values for p*=15% with a'^=2°C (gray) and 

cr^ =8 °C (transparent).115 

Eigure 46. Empirical optimal overlap threshold for reducing repeat false alarms for 

the event 2-m temperature < 0°C using the NCEP GEES training dataset. 

The optimal overlap threshold is computed at each C/L from 0.01-0.99 at 

an increment of 0.01 (solid line).116 

Eigure 47. Comparison of primary value metrics (a) optimal 1/5, (b) POD and (c) 

POMD used to find the optimal overlap threshold for C/L 0.01. Control 

scores in all three panels are shown by the solid line with error bars 
representing the 95% CL The expected value of metrics using overlap 
threshold values from 0.5% to 50% at a 0.5% increment are shown by the 
dot-dashed line with a circle at each overlap threshold value. Arrows 
indicate the first point where expected value of each metric falls within the 
95% Cl of the control. The optimal overlap threshold is the lowest 
threshold value where the expected values of all three metrics fall within 
the 95% Cl of the control. In this case, the optimal overlap threshold is 


31.5%.117 

Eigure 48. Evolution of E96M EPS error variance for (a) mean error of ensemble 
mean and (b) fractional error in ensemble spread. The error variances are 

shown following calibration to remove systematic error.145 

Eigure 49. Average total ambiguity of the EoE ambiguity distributions for test 

forecast probability values 5% (o), 50% (*) and 95% (x).145 

Eigure 50. Arrangement of EoE constituents at a (a) high and (b) low ambiguity 
timeframe. The PDEs for 100 constituents in a single EoE forecast case 
are displayed using a normal fit (solid lines) for (a) r = 0.2 and (b) 

T = 4.8 time units. An arbitrary event threshold (dashed line) is also 
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Figure 52. 


Figure 53. 


Figure 54. 


Figure 55. 


Figure 56. 


Figure 57. 


Figure 58. 


shown for analysis of foreeast probability values for eaeh eonstituent. 
Note that in (b) a different event threshold is used, and abscissa and 

ordinate scaling has changed.146 

Example of forecast probability sensitivity to PDF spread and shifts in 
PDF location for low spread (thick solid) and high spread (dot-dash) PDF. 

In (a), both PDFs are located at 0.75, and the probability of preceding the 
event threshold (thin solid) is 15.9% and 35.4% for the low and high 
spread PDFs, respectively. In (b), each PDF is shifted to -0.25 while 
holding spread constant, giving probability values of 63.1% and 55% for 
the low and high spread PDFs, respectively. Probability for the low 
spread PDF changed by 47.2%, while the change was 19.6% for the high 

spread PDF.147 

Comparison of average variance between EoE constituent ensemble 
forecast mean values (A) and average variance of EoE constituent 
ensemble forecasts (■) with increasing lead time. The comparison was 
made using 100 EoE forecast cases each containing 100 constituent 

ensemble forecasts.147 

Comparing the average evolution of EoE constituent relationships to the 
typical EoE ambiguity evolution using (a) same as Eigure 52, (b) the ratio 
of average variance in location of EoE constituent ensemble forecasts’ 

means to average constituent variance and (c) same as Eigure 49.148 

Comparing the evolution of average E96M ensemble forecast statistics to 
the typical EoE ambiguity evolution using (a) the variance of mean error 
in the ensemble mean (A) and average ensemble forecast variance (■) 
computed from 24,000 E96M forecast cases, (b) the ratio of the variance 
of the mean error in the ensemble mean to the average ensemble variance 

in location and (c) same as Eigure 49.149 

Ratio of average variance of EoE constituent ensemble forecast means to 
the variance of the mean error in the ensemble forecast mean. The 
average variance in constituent means is computed using 100 EoE forecast 
cases each with 100 constituent forecasts. The mean error is computed 
using 24,000 E96M EPS forecast cases, where the variance in mean error 
is found by computing the mean error over 3,000 subsets of eight forecasts 

each and taking the variance.150 

Validation of CESq (o) and RCR (*) total ambiguity across all forecast 

lead times for the specific p* test values (shown in Eigure 37, page 108), 

which are labeled at the top of each panel.151 

Validation of CESq (o) and RCR (*) total ambiguity at select calibrated 

forecast probability values (/>*) (shown in Eigure 37, page 108) for 
forecast lead times 0.2-5.0 at an increment of 0.2. Eead times (r) are 

labeled at the top of each panel.155 

Total ambiguity evolution for EoE (+), CES (o), and RCR (*) for 
ambiguity distributions with expected value of 50%.164 
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Figure 68. 


Frequeney of uncertain ensemble forecasts (i.e., control ensemble 
forecasts with p] between 0.1% and 99.9%) for (a) the common event of 

X >6.31 and (b) the rare event of X > 9.98. The ensemble forecast for 
each variable from the first constituent of each EoE forecast case was 

utilized as a control ensemble forecast for a total of 800 forecasts.164 

Ambiguity distributions for EoE (solid) and CESq (dashed) with expected 
value equal to 50% for a single EoE forecast case at r = 5 time units for a 
single Xf, variable. The distributions are approximated using a beta-fit to 
the estimated forecast probability values for each technique. The upper 
(UB) and low (EB) bounds of each technique’s 90% Cl (i.e., total 

ambiguity) are labeled.165 

Ambiguity distributions for EoE (solid) and CESq (dashed) with expected 

value equal to 5%. Same as Eigure 60.165 

Comparison of validation of CESq without correction (o) and with 

correction (x) applied to the variance of the ME- distribution. The 
correction is based on the ratio of variance in EoE constituents’ location to 

variance in ME- (Eigure 55).166 

Integrated optimal value score [lOVS, Equation (27), page 72] for the 
calibrated control ensemble forecast (solid) and the deterministic forecast 

(dashed) for (a) the common event and (b) the rare event.166 

Relative integrated optimal value score [lOVS, Equation (27), page 72] 
using uncertainty-folding with EoE (dashed), CESq (dotted) and RCR 

(dot-dashed) for (a) the common event and (b) the rare event. The score 
for the grand ensemble (solid) is also shown in both panels. Error bars 
represent the 95% Cl found using resampling. Note the ordinate scale 

change between (a) and (b).167 

Control forecast probability well located with respect to the expected 
value of the EoE ambiguity distribution (80%). A histogram of Pj values 
for a single EoE forecast case (100 constituents) is shown with a Beta-fit 
curve for the RCR ambiguity distribution (solid line) created using the 
first constituent in the EoE forecast case as the control forecast. The 


control forecast probability ( p] = 80% ) is marked by the dashed line.168 

Control forecast probability poorly located with respect to the expected 
value of EoE ambiguity distribution. Same as Eigure 65 with the expected 
value of the EoE ambiguity distribution at 20% and = 40%.168 

Optimal VS comparison for the GES deterministic forecast (*) versus the 
GEES forecast (o) using the application dataset of 50,220 forecast- 

observation pairs.169 

Number of repeat false alarms for the control user at each C/L based on 
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I. 


INTRODUCTION 


The primary tool for weather forecasters today is the Numerical Weather 
Prediction (NWP) model, and ensembles are rapidly gaining momentum as the preferred 
application. Ensemble forecasts provide an estimation of the uncertainty associated with 
NWP forecasts, but at this time the forecast is typically employed without consideration 
of the uncertainty associated with the ensemble’s prediction of uncertainty. This research 
is focused on exploring methods to objectively quantify the uncertainty in an ensemble 
forecast and determine the value of knowing that information. 

Over many years, the mold has been cast for using NWP models for deterministic 
forecasting, i.e., using a single model forecast to convey the future state of the 
atmosphere. Although great improvements have been made since the birth of NWP (e.g., 
increased computing power, better model physics, finer grid scales and improved 
numerical methods), the deterministic application of NWP still produces forecasts with a 
great deal of uncertainty (Leutbecher and Palmer 2007). We can’t get around the fact 
that even small errors in the initial conditions grow to produce large forecast errors 
(Lorenz 1969). Thus, deterministic NWP may not be the most effective approach. 
Improvement of the NWP model can provide only finite improvement in forecast quality 
(Brooks and Doswell 1993; Lorenz 1993). Ensemble forecasting was introduced as a 
means of objectively characterizing the uncertainty in NWP forecasts. It involves 
running multiple, parallel models (members), where each member has perturbations to 
the initial conditions and the model. An ideal ensemble prediction system (EPS) includes 
perturbations in the initial conditions that capture all possible errors in the analysis, as 
well as model perturbations representing all possible model errors, which requires an 
infinite number of members. 

Ensemble forecast information has several applications, including predicting 
deterministic forecast skill (via the ensemble spread) and improving deterministic 
forecast skill (via the ensemble mean) (Eckel 2008). The definitive application of 
ensemble forecasts is production of a forecast probability of occurrence for a specific 
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event (e.g. temperature < 0°C), whieh ean have high value in the deeision making proeess 
(Eekel 2008). Numerous studies have shown the value of using probabilistie deeision 
inputs over using deterministie or elimatologieal information (e.g., Katz and Murphy 
1997; Riehardson 2000; Palmer 2002; Zhu et al. 2002). The problem that is largely 
overlooked at this time is the uneertainty assoeiated with the ensemble foreeast itself 
(Eekel and Allen 2009). Elneertainty in the ensemble foreeast is due to design and 
eomputational restrietions that preelude running an ideal EPS. Today’s EPSs use finite 
number of ensemble members and inadequate representation of the uneertainty assoeiated 
with the initial eonditions and model design. Thus, there is uneertainty in the estimation 
of foreeast uneertainty, a phenomenon termed ambiguity. Ambiguity has been 
eonsidered in formal deeision seienee for many years and is generally studied in the vein 
of understanding people’s attitudes towards ambiguity in the deeision, or ambiguity 
aversion (Ellsberg 1961; Camerer and Weber 1992). In these studies, the deeision- 
maker’s estimate of the uneertainty is typieally subjeetive (Camerer and Weber 1992; 
Wallsten 1990). Applieation of objeetively estimated seeond-order uneertainty to 
optimize deeisions was not attempted. 

The main objeetives of this researeh are to: (1) understand the meehanisms behind 
the evolution of ambiguity assoeiated with an ensemble foreeast, (2) validate objeetive 
estimates of ambiguity assoeiated with an EPS, and (3) explore methods of applying the 
ambiguity information in order to add value in deeision making. 

This dissertation is organized into five ehapters, ineluding this Introduetion. The 
Baekground ehapter (Chapter II) provides an overview of basie ensemble foreeasting 
theory with a more in-depth look at sourees of error in the EPS direetly relating to 
ambiguity. In addition. Chapter II reviews the methods used during this researeh to 
determine the value of the ambiguity information in deeision making. Chapter III 
provides the Methodology used to aeeomplish the three researeh objeetives, ineluding the 
NWP model and EPS design. Results of the behavior, validation, and value studies are 
presented in Chapter IV. Einally, eonefusions and future researeh are addressed in 
Chapter V. 
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II. BACKGROUND 


A. ENSEMBLE FORECASTING 

Since the advent of Numerieal Weather Predietion (NWP) with the first 
suecessful 24-hour foreeasts by Chamey and his group in 1949, the primary role of NWP 
has been to produce deterministic prognoses of the future state of the atmosphere (Lewis 
2005). Over the following decades, the chaotic nature of the atmosphere has eome to be 
understood by meteorologists, ushering in a new paradigm for atmospheric prediction. 
First conceived by Poincare in 1914 and later proven by Lorenz in his seminal paper in 
1963 (Eckel 2008; Lorenz 1963), chaos describes the behavior of nonlinear dynamieal 
systems. What appears to be randomness in the evolution of the deterministic system is 
aetually the result of sensitive dependenee to initial conditions (ICs). Small errors in the 
ICs are evolved aeeording to the system’s (deterministie) rules, and these errors grow 
nonlinearly with inereasing foreeast lead time. Ultimately, the error grows so large that 
the foreeast is no better than one eoneeived using past observational data (i.e. 
elimatology). At this point, the limit of predictability has been reached. 

Observations of the eurrent state of the atmosphere cannot accurately represent 
the current eonditions at all points and on all seales. Thus, even if our NWP models were 
perfect, error in the ICs would render the forecasts useless after a short time. As an 
added eomplieation, our NWP modeling systems are not perfeet in that they cannot 
represent atmospherie phenomena on all spatial or temporal seales, foreing modelers to 
approximate many subgrid seale, unresolved processes. Thus even given perfeet ICs, 
model deficiencies would again result in nonlinear error growth and limit predictability. 

Forecasts of the future states must be looked at as uneertain events where there 
exists some ehance of oecurrence (Eekel 2008). The eoneept of ensemble foreeasting 
(EE) was first introduced by Eeith in 1974 (Eeith 1974; Eewis 2005). Eeith proposed 
using multiple perturbed NWP runs to produee a limited sample of possible future states. 
By using the mean value of foreeasts from approximately 10 different NWP runs, Eeith 
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was able to show improvements in foreeasts with lead times out to 10 days (Sivillo et al. 
1997). While requiring large eomputational resourees, EF was seen as a viable method to 
estimate foreeast uneertainty. 

An EF is essentially a group of eoneurrent NWP foreeasts, where eaeh member of 
the ensemble is run using slightly different (perturbed) ICs and perturbations to the NWP 
model. The purpose of the ensemble foreeast is to simulate the error growth assoeiated 
with errors in the analysis of the eurrent state and defieieneies in the NWP model, and to 
produee a sample of likely foreeast states (Eekel 2008). Separating these two error 
sourees in real-world ensemble predietion systems (EPS) may not be possible, as the 
first-guess used during the data assimilation (DA) proeess to produee the analysis for the 
next model run is typieally a foreeast state from the previous model run. This foreeast 
state {background) is then updated using observations to nudge it eloser to the eurrent 
observed state of the atmosphere, ultimately providing an analysis of the eurrent state that 
is more preeise than either the observations or the baekground. 

Anderson (1996), Eekel (2008), Toth and Kalnay (1993), and Traetion and 
Kalnay (1993) deseribe the basie applieations of EF data: 

• EF mean aeeuraey is better on average than deterministie NWP; 

• EF spread gives the eonfidenee in a single deterministie NWP model run; 

• Solution elusters ean aid in narrowing the most likely evolution; 

• Foreeast probability of oeeurrenee for some event ean be ealeulated from 
the distribution of EF members. 

Foreeast probability is the ultimate produet of EF data, sinee it provides the foreeast user 
with objeetive uneertainty information regarding the event in question. The user ean then 
eomplete a thorough risk analysis and optimize deeision making. 

There are many sourees of error in NWP within the two general types, analysis 
error and model error. An ideal EPS will aeeount for all sourees of uneertainty assoeiated 
with its modeling system. Any EPS defieieneies (errors not aeeounted for) result in 
errors in the ensemble foreeast probability density funetion (PDF). If the ensemble 
foreeast PDF is wrong, then measures of uneertainty in the foreeast will be ineorreet, 

ineluding foreeast probability. Sourees of error in the ensemble foreeast PDF inelude 
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limited sampling and poor simulation of IC errors and model errors. Model errors ean be 
assoeiated with the numerieal teehniques used in NWP as well as inaeeuraeies or 
uneertainty in subgrid seale, unresolved proeesses due to unrepresentative 
parameterizations or inadequate model resolution. In general, error ean be separated into 
two eategories, systematic and stochastic. Systematie error is bias or errors that 
eonsistently repeat. Stoehastie error deseribes the varianee of error about systematie 
error. Systematie and stoehastie errors oeeur in all moments of the ensemble foreeast 
PDF. 

The following seetions deseribe eurrent state-of-the-art teehniques used by 
operational foreeast eenters to generate IC and/or model perturbations in an EPS, while 
foeusing on the limitations of the teehniques and thus their eontributions to stoehastie 
error and ultimately ambiguity. Additionally, different sourees of model error, horizontal 
resolution and the implieations of limited sampling on ensemble foreeasting are 
diseussed. 

1. Accounting for Analysis Error—IC Perturbations 

Analysis error is any differenee between the estimated and the true state of the 
system at initialization of the NWP model. Analysis error may result from errors in the 
observations due to instrument limitations or the inability to observe at all spatial and 
temporal seales. Additionally, analysis error may eome about in data assimilation when 
an erred foreeast state from a previous model run (i.e., the baekground) is eombined with 
the observations. Also, when the baekground and observation information are eombined, 
error may be introdueed through interpolation or variable transformation. An analysis 
may be eonsidered perfeet (i.e., all grid point values aeeurately represent the average 
eonditions within the grid box) and still be in error sinee it eannot represent sub-grid 
seale eonditions or the numerieal preeision of aetual atmospherie variable values. 

The model analysis is a hyperdimensional veetor eontaining the values of all state 
variables, where the values deseribe the instantaneous state of the system in phase spaee 
(i.e., a region where all state variables are represented by a unique dimension) (Eekel 
2008). A perturbation or ehange to any state variable results in a ehange of loeation of 
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the state in phase spaee, whieh may be deseribed as a ehange in the direction of the 
hyperdimensional veetor pointing to the instantaneous state from a fixed origin. In 
ensemble foreeasting, IC perturbations produee possible analysis states within the 
model’s attraetor (i.e., the eolleetion of all naturally oeeurring states of the model in 
phase spaee) that are eonsistent with the analysis error eovarianee and struetured to 
simulate the fastest error growth based on the analyzed state of the system, for example, 
perturbing the loeation of a baroolinie zone. Thus the goal is to produee perturbed ICs 
that are equally likely estimates of the true state that eover all seales of motion and lead 
to aeeurate simulation of error growth assoeiated with the eurrent flow (Eekel 2008). 

Properly representing the sourees of uneertainty relevant to the eurrent flow is an 
important aspeet of EF. Purely random ICs used with a finite member EPS will likely not 
adequately represent the analysis uneertainty, as error growth assoeiated with many 
members will be too slow or even deerease early in the foreeast (Magnusson et al. 2008). 
The generation of ICs for EF is intended to provide a range of analysis states that allow 
the EF solution to adequately disperse given the eurrent uneertainty in the analyzed state. 
Given a perfeet model, the n members’ states of the EF should eneompass the true 
foreeast state at some later lead time at a rate of (n-l)/(n + l) (Eekel 2008). Several 

varying teehniques are eurrently in use at the major operational foreeast eenters, but these 
teehniques ean be divided into two eategories (Eeutbeeher and Palmer 2008—hereafter 
EP08). 

The first eategory is deseribed by EP08 as teehniques designed to produee 
perturbed ICs using ensemble-based DA, sueh as the Ensemble Kalman Filter (EnKF) 
with perturbed observations. In this method, employed by the Canadian Meteorologieal 
Serviee, multiple DA eyeles are performed using observations perturbed by random noise 
simulating observational error (EP08). EnKF produees an ensemble of analysis states 
that ean be used as EF ICs, where the ensemble of perturbed analyses is ereated by 
optimally eombining the perturbed observations with an ensemble of perturbed foreeasts. 
Also, the mean of the EnKF member states may be used as the best-guess analysis from 
whieh to start a single NWP model run. EnKF is diseussed in more detail in Chapter 

III .A. 3. A limiting faetor for the EnKF is estimation of baekground error eovarianee used 
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when updating the ensemble of perturbed foreeasts. An EnKF ensemble that is too small 
may result in spurious, unrealistie eorrelations between loeations in the model domain 
giving a noisy estimate of the baekground error eovarianee (Hamill et al. 2001; Lorene 
2003). Also, small ensemble sizes may result in baekground error eovarianee estimates 
that are too small leading to non-optimal estimates of the Kalman gain (Hamill et al. 
2001; Lorene 2003). Covarianee estimate errors may also be introdueed by errors in the 
NWP model used to integrate the EnKF members. The baekground error eovarianee 
problems deseribed and/or a laek of quality, timely observations may lead the EnKF 
analyses to drift away from the true state of the system resulting in large analysis error. 

The UK Meteorologieal Offiee (UKMO) uses a teehnique ealled the Ensemble 
Transform Kalman Filter (ETKF). The ETKF uses a transformation matrix to transform 
an ensemble of perturbed foreeast states into an ensemble of perturbed analysis states 
(Wang and Bishop 2003). The transformation matrix rotates and seales the foreeast 
perturbations based on observational information produeing orthogonal analysis 
perturbations exhibiting varianee that satisfies the Kalman filter error eovarianee update 
equation (Wang and Bishop 2003; Wei et al. 2006). The formulation of the ETKF does 
not allow it to be used to produee a best-guess analysis, so it must be used in eonjunetion 
with some DA teehnique (LP08; Wang and Bishop 2003). In this ease, the baekground 
error eovarianee matrix used in DA will not strietly mateh the eovarianee matrix 
developed using the ensemble leading to errors in the estimate of the analysis error 
eovarianee sinee the ETKF assumes the matriees mateh (Wei et al. 2006). Like the 
EnKF, this perturbation generation method is sensitive to the ensemble size, thus 
eovarianee inflation is neeessary to prevent underestimation of analysis error eovarianees 
for small ensembles (Wei et al. 2006). Underestimation of the analysis error eovarianee 
is also possible if model error is negleeted. Importantly, the transformation matrix and 
the inflation faetor as diseussed by Wang and Bishop (2003) and Wei et al. (2006) are 
sensitive to the spatial and temporal variability of the observation network used. Routine 
ehanges in the observation density in an operational observation network ean greatly 
affeet the aeeuraey of the ETKF. 
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The second category of techniques includes those that attempt to select 
perturbations capturing the greatest error growth over some forecast period. According 
to LP08, the techniques “selectively sample initial uncertainty only in directions that are 
dynamically most important for determining ensemble dispersion.” It is assumed that the 
growing modes found will continue to show the largest error growth beyond the forecast 
period used for selection. The bred vectors (BV) method, currently used in the National 
Center for Environmental Prediction’s (NCEP) short-range EE (SREE), is in this 
category. In BV, a random perturbation is applied to an initial state, and both the 
perturbed and original states are evolved forward using the NWP model over some 
forecast period. At the end of the forecast period, the vector difference between the 
perturbed and original states is found. This difference vector is rescaled to match the 
typical analysis error magnitude and then used to perturb a new initial state. After several 
repeated cycles, the final perturbation direction is found. This process is repeated using 
several random perturbations to find multiple final perturbations that are used as the 
ensemble ICs (EPOS). The BV method is limited by the fact that it attempts to find only 
the perturbations responsible for the greatest error growth, whereas other perturbation 
directions may also be important (Eckel 2008). In addition, the perturbation rescaling 
process can introduce errors, thus a regional or variable dependent rescaling may be 
necessary. Rescaling only certain variables that exceed a global analysis error value 
changes the direction of the hyperdimensional state vector describing the system and 
changes the direction of the perturbation found using BV (Eckel 2008). 

The European Centre for Medium-Range Weather Eorecasts (ECMWE) employs 
a technique that falls into the second category termed singular vectors (SV). The SV 
method finds the leading singular vectors or directions of maximum growth based on a 
linear version of the NWP model over some optimization period, typically taken as 48- 
hours for ECMWE ensemble forecasts (EP08). In other words, SV determines the 
directions of initial uncertainty that lead to the largest forecast uncertainty dynamically 
constrained by the NWP model (EP08). SV are sensitive to the choices made for the 
length of the optimization period and the norm used to evaluate the magnitude of the 
vector (e.g.. Euclidean norm or total energy norm) (Kalnay 2003). Thus very different 
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SV may result from varying these two parameters. Another limitation of SV is the 
assumption of linear error growth over the optimization period (ECMWF 2009) requiring 
the use of a tangent-linear version and adjoint of the full, nonlinear NWP model. The 
tangent-linear model employed at ECMWF uses a simplified physies paekage without 
physieal parameterizations (exeept for simple vertieal mixing and frietion), whieh may 
also result in suboptimal SV perturbations due to model defieieneies (EPOS). Magnusson 
et al. (2008) found that SV are best for shorter time-seale foreeasts, as their effeetiveness 
degrades at longer time seales. Presumably, at longer lead times the SV assoeiated with 
maximum error growth are different than those ealeulated over the optimization period. 

Comparison studies performed to determine if one method or eategory of IC 
generation teehniques is superior have had mixed results. Using eurrent operationally 
produeed data, it is diffieult to separate the teehniques from the numerieal models they 
are applied to, whieh are of varying quality, thus no eonelusive results have been found 
(EPOS). In idealized studies, more interesting and informative eomparisons between the 
eategories have been aehieved. Houtekamer and Derome (1995) found that the 
teehniques in eaeh of the eategories produeed equally skillful ensemble mean foreeasts. 
Hamill et al. (2000) found the ensemble-based methods had superior statistieal 
eonsisteney (defined by Anderson 1996, 1997 and Talagrand et al. 1997), mainly early in 
the foreeast period. In a more reeent study, Deseamps and Talagrand (2007) analyzed the 
skill and statistieal eonsisteney of ensemble foreeasts made using a model of a low-order 
dynamieal system as well as a quasi-geostrophie model in a perfeet model eontext using 
EnKF, ETKF, BV and SV initial eonditions. They found the skill of the ensemble mean 
was signifieantly higher for the EnKF and ETKF foreeasts. Statistieal eonsisteney and 
other foreeast skill and quality tests (i.e.. Brier seore and relative operating eharaeteristie) 
also showed signifieant improvement when using the ensemble-based methods. These 
results eonfirmed tests by Bowler (2006) who found EnKF outperformed SV and BV in 
an EPS based on the same low-order model used by Deseamps and Talgrand. 
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2 . 


Accounting for Model Error—Model Perturbations 


Model error is any differenee between the model attraetor and the true 
atmospherie attraetor resulting from the design of the NWP model, ineluding limits in 
model resolution, mathematieal formulation, physios, and lateral and surfaoe boundary 
oonditions. For example, parameterizations are used within the NWP model to aooount 
for the effeots of subgrid soale, unresolved prooesses on the foreoast evolution. In some 
oases, the parameterized prooess may not be well understood or the availability of 
observational studies used to develop or train the parameterization may be limited. In 
these oases, foreoast unoertainty may be high when the foreoast trajeotory is sensitive to 
the parameterization errors. The aim of perturbing the model is to introduoe equally 
likely perturbations that represent probable model error oovering all soales of motion 
(Eokel 2008), thus providing model diversity during the ensemble foreoast that 
adequately represents the ourrent flow’s sensitivity to the model error. 

Although a majority of the researoh into proper perturbations for an EPS has been 
fooused on generation of ICs, the signifioanoe of model defioienoies to unoertainty in the 
ensemble foreoast oannot be overlooked. Aooounting for model error using one of the 
teohniques desoribed below oan inorease dispersion and improve overall skill partioularly 
for surfaoe, sensible weather phenomena of oonoern to users (Eokel 2003; Mylne et al. 
2002). EPSs that do not aooount for model error are neoessarily under-dispersive, as the 
model attraetor does not mimio the true system attraetor. 

Desoamps and Talagrand (2007) expanded their study of the quasi-geostrophio 
EPS to inolude model error. Onoe again, they found the ensemble-based IC perturbation 
teohniques performed the best, but the skill and oonsistenoy of all methods was reduoed 
by the introduotion of model error. Their results also indioated that the gains made by 
using the ensemble-based IC generation methods did not last as long into the foreoast 
period when model error was introduoed. The authors explain this result as a 
oonsequenoe of rapidly growing transient instabilities (errors) in the flow generated early 
in the foreoast. Thus at later lead times, defioienoies in the underlying foreoast model 
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may rapidly dominate over the quality of the initial eonditions in regards to foreeast 
uneertainty. Therefore, uneertainty in the foreeast due to model defieieneies must be 
aeeounted for. 


a. Basic Techniques 

There are three basie teehniques used to aeeount for model error in an 
EPS. The first teehnique is ealled stochastic-physics and is used at ECMWF. Buizza et 
al. (1999) deseribe stoehastie-physies as randomly perturbing the tendeney of the state 
variables during integration with “some appropriate degree of spatio-temporal 
autoeorrelation.” The state variables are perturbed in an attempt to eapture the influenee 
of parameterization errors. Studies by Evans et al. (2000), Ziehmann (2000), and 
Riehardson (2001) indieate this teehnique has limited effeetiveness, likely due to eaeh 
ensemble member using the same model attraetor resulting in limited diversity. Random 
ehanges to the state variables move a member’s trajeetory off of the attraetor, but it then 
eonverges baek rapidly (Eekel 2003). Baekseatter is another stoehastie method used to 
aeeount for unrepresented dynamieal proeesses in the NWP model, where energy at 
subgrid seales is exeited and transferred up-seale to resolved seales (Shutts 2005). Shutts 
provides support for the argument that energy dissipation in NWP models is exeessive, 
thus arguing the need for loeal up-seale kinetie energy transfer. He found an 
improvement in ensemble spread, eonsisteney and skill using an EPS with stoehastie- 
physies and stoehastie baekseatter, while aeknowledging the stoehastie-physies 
eontribution to inereased spread was “small but eonsistently positive.” Baekseatter is 
limited by our ability to aeeurately estimate atmospherie energy dissipation and loeal up- 
seale energy transfer (Shutts 2005), whieh naturally leads to errors when exeiting energy 
transfer in the NWP model. 

The next model perturbation teehnique is termed perturbed model. In this 
method, a single NWP model is used, but parameterizations within the model are 
perturbed for eaeh member. Elnderstanding the uneertainty in the model assoeiated with 
the parameterizations is a diffieult question. Model parameterizations are perturbed 
within some estimate of the parameter uneertainty with the assumption that the eorreet 
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mean tendency can be achieved from one of the perturbations (LP08). In this way, it is 
assumed that the distribution of possible forecast states will encompass the true forecast 
state. However, each member shares many of the same model design features (i.e., the 
model core) and may not adequately reflect the uncertainty in the current flow (Eckel 
2003). Using a stochastic (randomly perturbed) parameterization in a low-order model, 
Wilks (2005) found the stochastic parameterization outperformed a deterministic 
parameterization in representing the climatology of the true system, ensemble mean 
performance, and ensemble dispersion. However, the perturbed model EPS is limited by 
our understanding of the parameterized processes and thus our estimate of the associated 
uncertainty and the sensitivity of the forecast to errors in a given parameterization. Even 
if a parameterization is perturbed accurately, model error is inevitable since the single 
parameter value is used to represent a continuous spectrum of possible true values for a 
single model grid box. 

The final approach used to account for model error is the multi-model 
technique, where each ensemble member is based on a different NWP model or model 
configuration. Eor example, two members of a multi-model EPS can be the NCEP and 
ECMWF control forecasts, or they may both be from the same model where a different 
convective parameterization is used in each. The different models will generally have 
different numerical schemes, physics schemes and parameterizations. In this way, each 
member model has a distinct attractor increasing ensemble dispersion. It has been shown 
that differences in skill among members for a given forecast is not a problem, as this also 
adds diversity to the distribution of forecast solutions. The primary assumption is that on 
any given day, any of the ensemble members has the potential to outperform the others. 
Thus a model that consistently exhibits low individual skill may not add skill to the 
ensemble forecast. Mylne et al. (2002) found that a multi-model EPS improves skill, 
while Ebert (2001) showed that multi-model ensembles were less likely to suffer from 
under-dispersion due to systematic errors. Multi-model ensembles are limited by the 
availability of different NWP models or by computational restrictions that do not allow 
all possible combinations of model configurations. When using a multi-model ensemble 
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where several members are designed around a single model eore, similarities between the 
ensemble members will reduee model diversity. 

Error in the NWP model ean eome from many different sourees. The 
model perturbation teehniques deseribed approaeh the sourees of model error from 
different perspeetives in an attempt to simulate foreeast uneertainty. Thus, an ideal EPS 
should use all of these methods in eonjunetion with one another to aehieve the greatest 
model diversity and the most aeeurate estimate of foreeast uneertainty. Additional 
sourees of NWP model error that must be aeeounted for are deseribed in the following 
seetions. 


b. Boundary Conditions 

Model boundary eonditions are a signifieant souree of model error that is 
normally aeeounted for separately (from the above basie teehniques) in an ensemble. 
This souree of error ineludes the handling of lateral boundary eonditions (EBC) for a 
limited-area model (EAM) as well as the surfaee and upper boundaries of any NWP 
model. A EAM requires the use of EBC updates during model integration, generally 
supplied by a global NWP model, to transfer information from outside the EAM aeross 
the boundary. An EPS based on a EAM must perturb the EBCs to eapture uneertainty 
flowing aeross the boundary into the EAM domain. Studies deseribed in Nutter et al. 
(2004a and 2004b) indieated that EAMs showed greater sensitivity to ehanges in EBCs 
than in ICs, and that a SREE using perturbed EBCs produeed foreeasts with improved 
dispersion. EBCs may be taken from the members of a single- or multi-model, global 
EPS, where the differenees between the global EBCs provide the perturbations. 
Perturbations to EBCs are limited by the eoarse spatial and temporal availability of global 
information used to update the boundaries, thus missing mesoseale variability (Eekel 
2008). Nutter et al. showed how the limitations may be mitigated using dynamieally 
eonsistent, fineseale random perturbations mimieking error growth at every time step 
between EBC updates, but there are no operational eenters eurrently applying this 
teehnique. 
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The surface boundary is another aspect of NWP modeling that plays a 
significant role in producing model error. Surface boundary parameters, such as soil 
moisture, soil type, vegetation type and fraction, and snow cover for example, impact the 
evolution of the atmosphere. While these variables are continuous in nature, they are 
accounted for in the model using two-dimensional surface fields providing representative 
values on the NWP model grid (e.g., using seasonally-based land use tables). Thus the 
surface boundary parameters are a source of random error since they cannot accurately 
represent conditions at all scales and sensitivity to parameter errors are unknown for any 
given forecast. Another significant source of error at the surface boundary is the sea 
surface temperature (SST) field used in the NWP model. Most operational NWP 
modeling systems are not coupled to an ocean model and use only a static SST analysis 
throughout the forecast (Kalnay 2003). Perturbations to the surface boundary parameter 
fields within the estimated uncertainty may be used in an EPS to account for sensitivities 
associated with variations (Eckel and Mass 2005). The methods used to account for 
uncertainty in the formulation of surface boundary parameters are limited, as many of the 
surface processes taking place may not be well understood or well observed (spatially 
and temporally). Initializing and estimating the uncertainty associated with these 
processes in order to perturb them properly is difficult, and incomplete parameter field 
tables may potentially omit significant parameters. 

Model error can also be produced by interactions at the model’s upper 
boundary, where assumptions must be made regarding the evolution of conditions above 
the model’s top during integration. In addition, the rigid lid or constant pressure surface 
employed by most NWP models results in gravity wave reflection, which can impact 
forecast conditions throughout the depth of the model. The gravity wave effects may be 
mitigated by absorption, damping or other techniques at the upper boundary. At this 
time, no operational EPSs consider the uncertainty associated with upper boundary 
conditions or interactions at the boundary during the forecast. 
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c. Horizontal Resolution 

Another critical source of model error is the effect of subgrid scale or 
unresolved dynamical processes. Grid point models can adequately resolve features on 
the scale of 7-8 grid points (Kalnay 2003), which leads to many dynamical processes 
taking place below the resolution of the model. These unresolved and therefore 
unaccounted for processes add uncertainty to the forecast, resulting in errors in the 
forecast PDF and under-dispersion in that the range of forecast states cannot reach the 
range of possible true states. Although the stochastic physics techniques described 
previously (Chapter II.A.2.a) attempt to account for these errors, they have been shown to 
produce marginal improvements and cannot fully simulate the subgrid scale uncertainty. 
Additionally, the model attractor will never exactly mimic the atmospheric attractor due 
to its limited dimensionality (i.e., resolution). Thus any given model state, where the 
value of the state variables at each grid point are the mean values for the grid box, may 
actually map to many different true atmospheric states creating model error. 

Increasing horizontal resolution has been shown to improve the skill of the 
ensemble mean (Szunyogh and Toth 2002), thus providing a forecast PDF with reduced 
random error in location. An ensemble study conducted by Mullen and Buizza (2002) 
centered on the impacts of horizontal resolution and ensemble size on precipitation 
forecasts found a higher-resolution model performed better than a lower-resolution model 
for multiple consistency and skill measures (e.g. rank histograms, BSS, ROC). Their 
findings also indicated that using a lower resolution model while increasing ensemble 
size can outperform an EPS using higher resolution and fewer members, especially when 
forecasting rare events. However, given an equal number of members, the higher- 
resolution EPS will perform better. 

Toth et al. (2002) assert the true value of increasing horizontal resolution 
is found when applied to ensemble forecasting. Their study showed improved ensemble 
mean and probabilistic forecasts for 500 hPa geopotential heights in the northern and 
southern hemisphere extratropics. Noise associated with small-scale features that have 
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lost predictability interacting with larger-scale, predictable features realistically represent 
natural processes and improves the ensemble’s performance. 


3. Limited Sampling 

Computational constraints on an operational EPS force forecast centers to limit 
the number of ensemble members to ensure timely delivery of forecast products. An 
ensemble with few members cannot consistently reproduce the forecast PDF from which 
they are drawn (Figure 1). The mean and spread error for any one case due to limited 
sampling cannot be known a priori. Sampling distributions of error in the ensemble 
mean and error in the ensemble spread based on random draws from an A (0,1) 

distribution for different ensemble sizes, show that error can vary greatly, especially for 
small ensembles (Figure 2). For both distributions in Figure 2, the potential error in the 
statistics decreases with increasing ensemble size, indicating that increased sampling 
provides a better estimate of the forecast PDF (Wilks 2006). Error in the estimated PDF 
due to limited sampling decreases exponentially with increasing ensemble size. The 
exponential decrease naturally leads to a leveling off of improvements to skill, where the 
added benefit may no longer justify the additional expense of adding more members. A 
similar effect is seen when comparing the skill of ensemble probability forecasts. 

While the techniques described for perturbing the initial conditions and the NWP 
model in an EPS are sophisticated compared to purely random perturbations, they are still 
limited in their ability to fully cover the spectrum of possible error sources associated 
with an NWP modeling system. Even if an EPS were perturbed perfectly, limited 
sampling would generate random error in the forecast PDF. The inescapable existence of 
random or seemingly random error in the ensemble forecast means that ambiguity is 
inevitable in predictions of forecast uncertainty. 

B. AMBIGUITY 

In general terms, ambiguity, or second-order uncertainty, can be described as the 
uncertainty associated with estimates of uncertainty (NRC 2006). Camerer and Weber 

(1992) defined ambiguity as “uncertainty about probability, created by missing 
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information that is relevant and could be known.” Ensemble-based forecast probability 
provides uncertainty information regarding the future state of a system, specific to an 
event criterion. Ambiguity is therefore the uncertainty surrounding the forecast 
probability (NRC 2006; Eckel and Allen 2009), which can be described by a distribution 
of forecast probability values, referred to here as an ambiguity distribution. 

Ambiguity may be found in ensemble forecasts that have limited sampling and 
insufficient simulation of sources of forecast uncertainty, i.e., the relevant, missing 
information. Elltimately, this leads to an inability of the ensemble to consistently 
reproduce or represent the true forecast PDE. The true forecast PDE is defined as the 
aggregate of all possible atmospheric states given a particular analysis using a specific 
NWP modeling system (Eckel and Allen 2009). Therefore, the true forecast PDE is 
specific to the EPS’s underlying modeling system. Assume we have an infinite record of 
model analyses along with the resulting forecasts and atmospheric observations, where 
neither the model nor the atmospheric system has changed. To determine the true 
forecast PDF for a specific forecast lead time, we first search the record for all previous 
model forecasts matching the current model forecast. The analyses used to initialize each 
match will be the same, but their associated true states will be unique due to analysis 
error, thus the true state at each forecast lead time for each matching forecast will be 
unique. The true forecast PDF at the desired lead is then the combination of all verifying 
observations for the matching forecasts. 

Eimited sampling plays a large role in creating ambiguity. From Figure 1, an EPS 
with finite members cannot consistently represent the distribution from which the 
members are drawn, even if the EPS is otherwise ideal. The ensemble PDF may be 
calibrated to provide a reliable forecast on average, but the error for any specific case 
cannot be known before hand, which results in random error in the forecast PDF’s 
moments (Figure 2). This random error exists even for large ensembles, thus sampling is 
a persistent source of random error and ambiguity in the ensemble forecast. 

A non-ideal EPS misses simulation of some aspects of IC and model uncertainty, 
resulting in an ensemble forecast PDF with a variable and unknown ability to represent 

the true forecast PDF thus yielding ambiguity. Imagine an ideal EPS except that it is 

17 



designed with a single eonveetive parameterization used for all members (i.e., uneertainty 
in modeling of eonveetion is not represented). The ensemble PDF will be a elose 
approximation to the true PDF if the error in the parameter value is low, or the sensitivity 
of the foreeast to the parameter error is low, i.e., when eonveetion is not present (Eekel 
and Allen 2009). The ensemble PDF may be a poor approximation when either 
parameter error or sensitivity is high. The variable representativeness of the foreeast PDF 
for any one ease eannot be predetermined, thus error in the foreeast PDF appears random 
(Eekel and Allen 2009). 

In this researeh, when eonsidering ambiguity assoeiated with ensemble foreeast 
probabilities, systematie errors (bias) in the first two moments (mean and varianee) of the 
PDF are removed through ealibration leaving only the stoehastie or random error. 
Although error may be present in higher moments, it is assumed that errors in the first 
two moments have the largest role in ereating ambiguity. This may be explained by 
eonsidering ehanges in probability density assoeiated with ehanges to different moments 
of the foreeast PDF. A ehange in the first moment (i.e., loeation) of the PDF results in a 
large shift of probability density and thus a relatively large ehange in foreeast probability 
(depending on the plaeement of the event threshold within the PDF). While generally not 
as large as ehanges assoeiated with the first moment, deereasing or inereasing the 
varianee (i.e., seeond moment) of the foreeast PDF also has the potential to ereate a 
signifieant ehange in probability density, and therefore foreeast probability. The ability 
to signifieantly adjust probability density relative to a given event threshold deereases 
with higher moments of the foreeast PDF. 

Camerer and Weber (1992) posit that uneertainty (first-order) and ambiguity 
(seeond-order uneertainty) are fundamentally different eoneepts. The magnitude of first- 
order uneertainty that ean be measured as varianee in the ensemble foreeast PDF ean be 
independent of the magnitude of ambiguity (i.e., misrepresentation of uneertainty in the 
ensemble foreeast PDF due to random errors in the mean and/or varianee). For instanee, 
a well-designed EPS (i.e., well-sampled, well-perturbed) that is based on a poor NWP 
modeling system would produee foreeasts with large uneertainty but very little ambiguity 
(Eekel and Allen 2009). Conversely, a poorly-designed EPS based on a highly skilled 
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NWP model would produce forecasts with low uncertainty and high ambiguity. 
However, Eckel and Allen (2009) assert that “larger uncertainty and/or more diversity in 
its sources may increase the opportunity for ensemble deficiencies, which can create 
ambiguity,” thus correlating forecast uncertainty and ambiguity. Further evidence to 
support a relationship between ambiguity and the uncertainty in the current forecast is 
presented in the results section of this dissertation. 

C. FORECAST VALUE 

The primary value of weather forecasts to users is the better consequences 
(economic or other benefits) realized from using the information in the decision making 
process (Zhu et al. 2002). Any new source of weather forecast information should add 
value to the decision maker. Value is added when the information allows the user to take 
actions that improve overall, long-term average consequences over many decision 
opportunities. In this research, we are concerned with the impact of introducing the 
ambiguity information into the user’s decision making process. If users cannot 
effectively use the ambiguity information to add value, then the information holds merely 
entertainment value at best or confuses the user and detracts from optimal decision 
making at worst. 

The analysis of value will be performed in the simple cost-loss (C/L) ratio 
scenario (Murphy 1985; Katz and Murphy 1997; Jolliffe and Stephenson 2003). In the 
basic C/L scenario, the user will either decide to take protective action to mitigate the 
effects of some weather event or take no protective action based on the weather input. If 
the user decides to protect, he incurs a cost (C) for taking the protective action, regardless 
of whether or not the weather event occurs. If the user does not protect and the event 
does not occur, he incurs no expense. Otherwise, if the user does not protect and the 
event occurs, he will incur a loss (L). In this research, we assume the protective action is 
sufficient to guard against all loss. The results of the four possible preparation-outcome 
combinations over a forecast-observation dataset can be tallied in a 2x2 contingency 
table as shown in Table 1 (Jolliffe and Stephenson 2003). The expense (£) associated 
with each possible consequence is given in the Table 2. 
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For a deterministic forecast, the weather input to the decision making process is 
binary, whereas the stochastic forecast provides a probabilistic input. The stochastic 
input (i.e., the forecast probability, p^) is converted to a binary input through application 

of a decision threshold or decision rule that expresses the amount of risk (i.e. the chance 
of getting an undesirable consequence) the user is willing to accept in the forecast of the 
weather event (Jolliffe and Stephenson 2003). In the C/L scenario, the goal is to use a 
decision rule that minimizes the total expense over many forecast cases, or the expected 
total expense. 

The value score {VS), introduced by Richardson (2000), is a measure of the value 
of weather forecasts that can be explored through the C/L model. Using tallies {a, b and c 
defined in Table 1) accrued in the contingency table over M forecast-observation 
pairings, it is possible to calculate VS for any C/L ratio (a ): 


VS = 


M 


(aa + ha + c)-min 
da-mm{a, d) 


{a, o ) 


( 1 ) 


where o = (a+ c) / M is the sample’s climatological rate of occurrence. In this form, the 
value of the forecast information is calculated assuming that in the absence of a forecast 
decisions will be made based on d (i.e., protecting when d>a). Additionally, 
decisions will be made using the ensemble forecast (i.e., protecting when p^>a). A 

perfect forecast has a VS = 1, while a VS > 0 indicates the forecast system adds value 
compared to following sample climatology. The forecast system has VS < 0 when it 
performs worse than climatology. 

In the context of the C/L scenario where the goal is to minimize expected 
expense, optimal value is attained by a customer who chooses their decision rule or 
decision threshold to match their C/L. This fact can be demonstrated as follows. For 
many (M) instances in which the forecast probability {p^) takes a specific value, a user 
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would either always protect or never protect based on their decision rule. For the two 
cases, the total expense (£) can be expressed respectively as: 


E,ro.c,=M*C ( 2 ) 

E,„Pro.ec,=M*P,*L (3) 


The user’s decision should then be to protect when: 


r <7^ 

^Protect — ^No Protect 


M*C<M* p^*L 


Pe 



(4) 


Alternatively, the user’s decision rule calls for taking no action when: 


P < P 

^No Protect ^ ^Protect 


Pe 


< 


C 

L 


(5) 


As shown above, the user’s optimal decision threshold is their C/L, prompting them to 
take protective action when p^ is greater then C/L and to take no protective action when 

p^ is less than C/L (Jolliffe and Stephenson 2003). Using this information, an analysis of 

the optimal VS obtainable by all customers can be determined. The curve in Figure 3 is 

the optimal VS created using data from the low-order model employed in this research 

(Chapter UFA). The VS for each C/L is calculated based on a unique contingency table 

(e.g.. Table 2) for each user built using the C/L in question as the decision rule. 
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Ambiguity, or uncertainty in the forecast probability, adds another dimension to 
the deeision making proeess, resulting in three possibilities given a user’s C/L: 

• The entire ambiguity distribution may be below the C/L (i.e., optimal 
decision threshold) so the user is eonvinced to take not protect. 

• The entire ambiguity distribution is above the C/L so the user is convinced 
in their decision to protect. 

• The ambiguity distribution overlaps the C/L. In this case, the appropriate 
deeision is unclear to the user. 

The term overlap is used here to refer to the total proportion of the ambiguity distribution 
that crosses the C/L in the direetion opposing the deeision based on the best-guess of the 
eurrent risk, i.e., the ehance of making the wrong deeision. The ensemble forecast 
probability is taken as the best-guess risk, or alternately the best-guess forecast 
probability, since it represents the likelihood of the verifying observation erossing the 
event threshold resulting in a negative consequenee. In Figure 4, the foreeast probability 
indicates the user should protect. The ambiguity distribution overlap (hatehed) in the 
figure describes the probability that the actual forecast probability is less than the C/L and 
the user should not proteet. 

In the C/L scenario, long-term expense may still be minimized by using the best 
estimate of risk even if ambiguity is present and ignored. The ensemble’s estimate of risk 
(i.e., the forecast probability) is simply a random draw from a distribution of many 
possible foreeast probability values (i.e., the ambiguity distribution). Given situations 
where overlap exists over many forecast cases, the best-guess risk will result in both 
positive and negative consequences, with the expeetation that the foreeast probability is 
truly a random draw from the ambiguity distribution and the seleetion proeess is not 
biased towards either of the consequenee categories. Thus, the optimal user, when 
eomparing the ensemble forecast probability alone as a measure of risk against the C/L 
(i.e., the optimal deeision threshold), implicitly includes eases where overlap exists. 

To this point, we have not addressed the question of value added via knowledge 
of ambiguity. This researeh introduees two approaches for attempting to add value to the 
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decision making process in situations where the decision input is unclear (i.e., overlap 
exists) using objective estimates of the ambiguity associated with the ensemble forecast. 

1. Uncertainty-folding 

The first approach to gain value from ambiguity information, called uncertainty¬ 
folding, combines the (first-order) uncertainty and ambiguity information to once again 
give the user a single probabilistic decision input based on the weather information. 
Given a sample of possible true forecast probability values {pj) (i.e., ambiguity 
distribution or second-order uncertainty) estimated using some objective method, each 
Pj value is binned using a class interval of 1% over the range 0% to 100%. The relative 
frequency associated with each bin, r{d ), within the sample is determined. Note that 
5 = [ 0.05,..., 0.995 } (i.e., each bin’s center value) is a possible true forecast probability 

value, and therefore represents a possible value of risk (i.e., first-order uncertainty). Each 
S value is multiplied by its respective relative frequency then summed to produce a 
single estimation of the forecast probability () that includes the ambiguity information. 

Ea = S 

s 

An example of this process is described in Figure 5. 

As value studies in this research are focused on the C/L scenario, it is important to 
address whether or not the C/L is the optimal decision rule to minimize expense when 
using p^. Samples from the ambiguity distribution (i.e., estimates of the true forecast 

probability, Pj ) are all equally plausible realizations of the forecast probability for an 

event given the EPS’s sensitivity to the deficient simulation of uncertainty in the IC and 

model perturbations. As discussed, for any reliable ensemble forecast probability, the 

C/L is the optimal decision rule to minimize long-term expense, but while the forecast 

probability may be reliable on average, random error and ambiguity still exist for 

individual forecast cases. Thus, using the C/L with any random Pj value taken from the 

ambiguity distribution over many cases will minimize expected expense in much the 
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same way as using a single ensemble foreeast. The value computed using 
uncertainty-folding is merely a combination of information from all of the Pj values and 

is therefore simply another plausible realization of the true forecast probability. In 
practice, this theory depends on obtaining an accurate objective estimate of the ambiguity 
distribution, which is discussed in the results section. 

The control ensemble’s calibrated forecast probability (p*) is a random sample 
taken from the ambiguity distribution. On the other hand, uncertainty-folding will 
produce a p^ value close to the expected value of the ambiguity distribution. Thus the 

difference between p* and p^ may be large enough to result in different decision inputs, 

i.e., they fall on opposite sides of the C/L. Over the long-term, p^ should provide the 

best risk estimate and minimize expense by minimizing the error between the estimated 
risk used to make the decision and the true risk. 

2. Secondary Decision Criteria 

Dealing only with the economic value of information (i.e., the C/L scenario) 
neglects factors hard to quantify in terms of dollars that can also bring important 
consequences (e.g., loss of life, customer confidence, morale, mission effectiveness). 
Wallsten (1990) stated that ambiguity information was especially suited to decisions with 
multiple criteria. Thus, if the weather input to the decision is ambiguous, the user may be 
justified to take other factors into account to make the decision. This idea is used for the 
second approach to determine the value of ambiguity information. The simple C/L model 
is still applied, but when the ambiguity distribution overlaps the decision threshold 
(decision is unclear), the user may consider other (non-monetary) decision criteria to 
reverse the decision that would be made based purely on the best-guess risk. The option 
to include these secondary criteria comes with several questions: 

• How much overlap of the ambiguity distribution across the optimal 
decision threshold (C/L) is necessary before the user should consider 
secondary decision criteria? 

• How does the decision-maker decide whether or not to change their 
decision? 

• How can we measure the improvement in secondary consequences? 
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Using this approach, the idea is not to inerease the primary eeonomie value 
(represented by the VS), but rather to add value to the user by improving eonsequences in 
terms of their seeondary eoncems. The goal is to add value to the secondary criteria 
without significantly decreasing the primary value aehieved using the first-order eriteria. 

As an example of a seeondary eriteria, eonsider a user who eannot tolerate repeat 
false alarms (i.e., the event is foreeast to oeeur but does not oeeur). An example may be a 
base eommander, who previously evacuated aireraft and personnel when a typhoon was 
foreeast to strike the base, but the typhoon traek changed and it missed the base. The 
eommander’s deeision, although justified by risk analysis, resulted in degradation of 
mission effeetiveness and unneeessary expense. As the next typhoon approaehes, the 
eommander desperately wishes to avoid another unneeessary evaeuation. If the 
eommander is given a risk elearly exceeding his C/L, he should again evacuate. But, 
given ambiguity and overlap, he may ehoose to stay put. 

Using an estimate of the ambiguity, it may be possible to reduee the likelihood of 
repeat false alarms by going against the deeision based on the best-guess foreeast 
probability, while not significantly changing the VS based on minimizing total expense. 
The idea is to reshuffle the outcomes to break up repeated false alarms, while keeping VS 
nearly constant. Changes to the deeision based on ineluding seeondary deeision criteria 
result in a different eontingeney table as eompared to basing deeisions only on the 
primary deeision eriteria (eontrol pj (Table 3). In order to prevent ehanges in the 

primary value, the seeondary criteria deeision rule must produce ehanges that preserve 
the overall balanee between positive and negative eonsequenees, while not biasing 
towards one extreme. The user essentially trades the expense assoeiated with a number 
of false alarms for the expense of a few extra misses as far as negative eonsequenees are 
eoneerned. Value is then measured as a signifieant deerease in the number of repeat false 
alarms for a user who employs the ambiguity information eompared to a user who bases 
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decisions solely on the best-guess forecast probability. Value is not gained if reducing 
the number of repeat false alarms results in a significant decrease in the primary value 
(i.e., increase in expenses). 

It is important to stress that this scenario is just one example of using secondary 
criteria to add value when the decision input is unclear. There are many possible criteria 
that can be explored, where the criteria are user or context dependent. 

There has been a great deal of effort put into designing EPSs to efficiently sample 
the uncertainty associated with an NWP modeling system, but the EPSs still have 
limitations that result in random error in the uncertainty estimates (i.e., ambiguity). The 
purpose of this research was not to explore EPS design, but rather to investigate methods 
for objectively estimating the ambiguity associated with an EPS and to understand how 
EPS deficiencies influence the magnitude of ambiguity. Additionally, while most 
research has focused on the decision maker’s attitude towards ambiguity in the decision, 
we apply objective ambiguity estimates in the decision making process in an effort to add 
value compared to a user who simply uses the ensemble’s uncertainty estimate. Thus, 
this research will attempt to show: (1) it is possible to produce reasonably accurate 
objective estimates of the ambiguity associated with an EPS and (2) the ambiguity 
information can add value to the decision making process. 
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Standvd Oeviaoon (o) 

Figure 1. Three simulated attempts to represent the foreeast PDF using an eight member 
“perfeet model” ensemble. The foreeast PDF (solid) being sampled is (0,1), 

while the realized ensemble PDF (dashed) is normal with parameters values 
ealeulated based on random ensemble members (a) mean and variance close to 
true values, (b) negatively biased mean and variance too small, (c) mean close to 
true and variance too large. Vertical lines represent the location of ensemble 

members. 
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(a) 



(Standardized Error in Ensemble Mean) 


(b) 



standardized Ensemble Standard Deviation 
(Fractional Error in Ensemble Spread) 

Figure 2. Sampling distributions of the (a) standardized error in ensemble mean and (b) 
fractional error in ensemble spread, dependent on the number of ensemble 
members. Results are shown for ensemble sizes of 10, 20, 40 and 80 members 
(labeled) [From Eckel and Allen 2009]. 
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Figure 3. Optimal value seore across the range of C/L values. The value score for each C/L 
is calculated using the C/L as the decision threshold. The climatological rate of 

occurrence (o ) is 29.5%. 


Best-Guess 



(forecast probability) 


Figure 4. Ambiguity distribution overlap in the C/L scenario. The hatched area represents 
the overlap of the ambiguity distribution beyond the C/L (blue line), which would 
result in a different decision than that found using the best-guess or control 

forecast probability (red line). 
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Figure 5. Histogram of possible first- and second-order uncertainty associated with some 
event used for calculating the uncertainty-folding forecast probability estimate 
(). As an example, the bin of forecast probability values 44% < p^< 45% 
(arrow) has a relative frequency of 5%, thus contributing 44.5% x 5% = 2.23% to 

the summation in Equation (6). 


30 



Table 1. Contingency table used to tally the number of consequences associated with a 
forecast-observation dataset. A hit (a) is tallied when the weather event is 
forecasted to occur and the event does occur. When the event is forecasted to 
occur and is not observed, the resulting consequence is a false alarm (b). 
Alternately, when a weather event is not forecasted to occur is observed, the 
consequence is a miss (c). Lastly, a correct rejection (d) is counted when the 
weather event is not forecast to occur and the event is not observed. 



Weather Event Observed 

Yes No 

Event Yes 

Forecast 
and/or Decide 
to Prepare 

a (# of hits) 

b (# of faise aiarms) 

c (# of misses) 

d (# of correct rejections) 


Table 2. Contingency table of consequences measured as the expense (E) associated with 
each forecast-observation pair within the C/L framework. C is the cost of taking 
protective action to mitigate the loss (L) if the event occurs. 



Weather Event Observed 

Yes No 

Event Yes 

Forecast 
and/or Decide 
to Prepare 

E=C 

E=C 

E = L 

E= 0 
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Table 3. Contingency table of possible changes in the repeat false alarm secondary 
decision criteria scenario. The change shown by the solid circle results in a 
positive consequence (correct rejection), while the change shown by the dotted 
circle results in a negative consequence (miss). 



Weather Event Observed 

Yes No 

Event 

Forecast 

and/or 

Decide to 

Prepare 

» a * 

9 

T 

C 

d 
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III. METHODOLOGY 


This chapter describes the methods used during this research to accomplish the 
stated research goals. Specifically, Section A gives a detailed look at the design of the 
EPS and the low-order model it was based on. Section B provides an overview of data 
postprocessing. Section C provides a description of the ambiguity estimation techniques, 
while Section D covers the validation of the techniques. The final section discusses the 
processes and scenarios used for determining the value of the ambiguity information. 
The primary programming platform used during this research was Matlab version 7.0 or 
later. 

A. L96 ENSEMBLE PREDICTION SYSTEM 

I. L96 Model Design 

In order to fully study the ambiguity associated with EF, it is necessary to have 
access to an EPS, a large forecast dataset, and suitable observation information. As a 
portion of this research will involve running multiple parallel EPS forecasts, using an 
EPS of an atmospheric model is impractical due to computational and storage limitations. 
Therefore, we use an EPS of a more simple dynamical system model to mimic an 
operational EPS. For this research, we chose the low-order, chaotic model first 
introduced by Eorenz (1996) as a suitable proxy for atmospheric NWP models. 

The model, hereafter E96, includes a set of symmetric, coupled equations 
describing the evolution of variables on two distinct time scales (Eorenz 1996; Wilks 
2005). 


rIY hr 

= £ y, ; k = l...K 

dt b 


(V) 


dY. 

dt 


he 
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j = \...JK (8) 



The model emulates atmospheric processes in that the linear and forcing (F) terms 
provide internal dissipation and external forcing, and the quadratic terms simulate 
advection (Lorenz 1996). Results garnered from experiments using the L96 model can 
therefore reasonably be assumed to apply to atmospheric modeling systems. To further 
ensure the validity of this research, we designed the L96 EPS to operate using state-of- 
the-art methods for data assimilation, ensemble perturbations, and numerical techniques. 

The variables in the L96 model can be thought of as describing large-scale, 
slow moving processes, and the Yj variables thought of as small-scale, fast moving 

processes, where energy is transferred between the two scales of motion (Lorenz 1996; 
Wilks 2005). A possible physical explanation of the modeled process would be to 
consider the Yj variables as representing convective-scale values while the variables 

represent large-scale static instability (Lorenz 1996). Described another way, the X^ 
variables are resolved on the model grid (latitude circle), while the Yj variables are 
unresolved or subgrid scale variables (Wilks 2005). 

The basic setup of the L96 model for this research follows Wilks (2005), with 
some modifications. After Wilks, X = 8 and J = 32, which corresponds to eight resolved 
variables and 256 unresolved variables (Ligure 6). Scaling constants h, c, and b are taken 
as 1, 10, and 10, respectively, which ensures both scales are chaotic (Lorenz 1996; Wilks 
2005). To mimic operational atmospheric models, the unresolved variables are not 
modeled explicitly and must be parameterized in some fashion since they influence the 
evolution of the large-scale, resolved variables. We assume the physical laws governing 
the resolved variables are known completely, but the effects due to the unresolved 
variables are not precisely known and must be parameterized (Wilks 2005). In this 
configuration, the last term on the right side of Equation (7) is replaced by a 
parameterization term ( g^j , described below) (Wilks 2005; Orrell 2003), 


dt 




k = \... K 


( 9 ) 
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This experiment design gives us a model with random error (i.e., uncertainty) as 
well as the ability to be omniscient and know the true evolution of the system beginning 
from a known initial condition. The full, coupled L96 equations provide the “ground 
truth” or “true” state of the system. To provide the true trajectory, Equations (7) and (8), 
henceforth termed the L96 System (L96S), are integrated forward using the fourth-order 
Runge-Kutta (RK4) (Weisstein 2009) numerical scheme at a time step of 0.001 (non- 
dimensional). In comparison, the parameterized L96 equations. Equation (9), henceforth 
the E96 Model (E96M), are integrated forward using the second-order Runge-Kutta 
(RK2) numerical scheme at a time step of 0.005. Eorecasts derived using L96M exhibit 
model errors as a result of using an inferior numerical integration scheme (RK2 vs. RK4) 
and from parameterization of the unresolved scales. 

The parameterization scheme in L96M is stochastic and based on the unresolved 
tendencies found between integrations of the E96S at time steps equivalent to the E96M 
(Wilks 2005). Developing the parameterization involves first integrating the E96S 
forward over some long trajectory with time step of 0.001, while storing all data. Then, 
for each time step over this long trajectory, the current resolved variable’s value (2f^(t) ) 

and the value at a time equivalent to one L96M time step (Xj^(t +At) , At = 0.005 ) are 
found in the E96S data. The unresolved tendencies U (t) are then calculated as: 




X,« + A<)-X,(t) 
At 







^_ J 

y 

B 


( 10 ) 


Term A represents the model tendency of over At, while term B gives the true 
tendency of over At. 

The range of subsequent values associated with each possible value represents 
the unresolved tendency in A^, or values that could be missed without explicitly 
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modeling the unresolved Yj variables. The symmetry of the governing equations 

produees very similar U for each resolved variable, so we can combine all results. The 
unresolved tendencies in are shown in Figure 7 for all eight resolved variables. The 

data are fit with a fourth-order polynomial regression (solid line in Figure 7). The 
unresolved tendency depends “strongly and nonlinearly on the value of the resolved 
variable” (Wilks 2005). Thus, for each value that the resolved variable can take on, there 
is a distribution of unresolved tendency values, centered on the regression curve. The 
parameterization function () is thus given both a deterministic and a stochastic 
component: 


g,{x,) = b, +b,x, +b,xl +bX +bX+q, (11) 

where hQ= 0.293, b^=l.55, hj =-0.0201, ^ 3 =-0.0106, and 64 =0.000565 are the 
regression coefficients. The deterministic component (the first five terms on the right 
hand side of Equation (11) is the regression equation. The q^. term on the right side 
represents the stochastic component that allows for parameter values off of the regression 
curve. 

The simple stochastic term is white noise produced by a normal distribution with 
zero mean and standard deviation of 2.32, which is equal to the average standard 
deviation of unresolved tendencies across all possible values. We rescale the 

stochastic component following Hansen and Penland (2006) who found that combining 
stochastic components with deterministic differential equations requires scaling by the 

square root of the time step ^ 1 / V0.005 = 14.14 j. Initial tests of the ensemble resulted in 

ensemble forecasts that were nearly perfectly dispersive. Since the EPS needed to mimic 

current operational ensembles that are typically under-dispersive (Wilks 2005; Buizza 

1997; Toth and Kalnay 1993; Hamill and Colucci 1997; Eckel 2008), we reduced the 

standard deviation of the white noise distribution to 1.2 to reach a suitable, albeit 

subjective, level of under-dispersion (Chapter III.A.4). By decreasing the range of white 
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noise in the L96M, additional model error was ereated, which resulted in under¬ 
dispersion since it was not accounted for in the EPS. 

2. L96 Climatology 

This section provides an understanding of the climatology of the L96S and L96M 
and notes any differences. For comparison and to show the advantages of using the 
stochastic parameterization, we introduce a simple deterministic parameterization where 
the stochastic component (< 7 ^) in Equation (11) is set to zero. Also discussed are the 

correlations between the resolved variables to motivate the decision to consider 

each as independent when determining the ensemble’s error characteristics and for use in 
the validation of the ambiguity estimation techniques. 

To determine the climatological statistics, we ran the E96S and E96M using both 
the deterministic and simple stochastic parameterization schemes over a period 
encompassing 5000 time units beginning from a random, transient-free initial state. For 
each of the model climatologies (deterministic and stochastic), all eight of the 

variables were stored. For the F96S climatology, both the X^. and Yj variables were 

maintained to understand the climatology of unresolved variables as well. A summary 
dataset was created for each configuration by assuming independence of the eight 
resolved variables (discussed below) and combining them. Climatological statistics for 
each dataset are displayed in Table 4. Comparing the range of values the resolved 
variable trajectory visited, it is clear the stochastic parameterization provided a 
climatology closer to F96S. The probability density of possible X^. values is shown in 

Figure 8 . Both model configurations do a reasonable job of representing the distribution 
of resolved variables, but the range and shape of the stochastic distribution is superior. 
The mean and standard deviation of the resolved variables are significantly closer to the 
“true” system values for the stochastic parameterization as well. These results further 
strengthened the case for implementing the F96M as described. 

We may consider the location of the resolved variables in the F96M as grid points 
on a single latitude circle (Forenz 1996; Wilks 2005), where the forecast values at a 
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specific lead time are essentially values at K adjacent grid points. Thus the resolved 
variables are analogous to variables in an operational NWP model (e.g., 2m temperature 
or 10m wind speed). Verification of operational deterministic and ensemble forecasts for 
a variable such as 2m temperature over a certain domain can be accomplished by using a 
high quality model analysis. An “observation” is available at each model grid point, so 
verification takes place at each grid point. These individual point statistics are then 
combined for an overall assessment of the modeling or ensemble system. In order to 
combine data from the individual grid points, the data should be independent and 
uncorrelated. In many cases, these assumptions do not strictly hold, but the correlations 
may be weak. In practice, the grid point data is typically combined under the assumption 
of independence to increase the size of the forecast-observation dataset used for 
verification. 

Evaluating the error characteristics and skill associated with an EPS requires an 
extensive set of forecasts and observations, hollowing Descamps and Talagrand (2007), 
who found that “cross correlation between the X^s is negligible” in the E96 system, we 

chose to assume independence between the variables. Although our testing indicated 

a pattern of moderate correlation between the variables, we proceeded under the 
assumption of independence to increase the size of the verification and climatological 
datasets. By making this choice, we may have underestimated the uncertainty associated 
with our results due to the increase in the size of the datasets. 

3. L96M EPS Design 

This section describes how E96M was incorporated into an EPS for this research. 
As described earlier, the goal of an EPS is to effectively account for all sources of 
uncertainty in the modeling system (Chapter II.A). Thus, state-of-the-art techniques were 
used to account for analysis errors and model deficiencies. 

To generate a control analysis and a suite of ensemble ICs, the process begins 
with a uniform random draw for each of the eight variables and each of the 256 Yj 

variables from their respective climatological ranges. Using the L96S, the random state 
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is integrated forward to converge upon an arbitrary state on the true system attractor. 
This state is taken as the current true state of the system. The L96S is then integrated 
forward from this state over the data assimilation period and the entire forecast period to 
provide a true trajectory for the system. The ICs for the ensemble members are found 
using an Ensemble Kalman Filter (EnKF) data assimilation scheme. 

Data assimilation is the process by which imperfect information (i.e., observations 
and model background) about the current state of a system are combined optimally to 
produce an analysis of the current state that is more precise than the original information 
(Kalnay 2003; Reichle et al. 2002). The Kalman Filter (KF) (Kalman 1960; Cohn 1997) 
is “an approximation of Bayesian state estimation which assumes linearity of error 
growth and normality of error distributions” (Hamill 2006). The KF process is generally 
divided into two steps, an update step and a forecast step. 

The update step involves adjusting an estimated state (e.g., background) and 
associated error statistics to new observations to form a new analysis state and 
uncertainty estimate. In the forecast step, the new analysis and uncertainty estimate are 
propagated forward to the next observation time using the full nonlinear dynamical model 
and tangent linear model and adjoint, respectively. Ultimately, the traditional KF is 
computationally too expensive for practical use in atmospheric data assimilation due to 
the high dimensionality of atmospheric modeling systems. 

The EnKF process is a sequential data assimilation technique that uses an 
ensemble of perturbed forecasts to provide the statistical information needed to produce 
the new analysis (Evensen 1994, 1997; Burgers 1998). The process is an approximation 
of the traditional KF or extended KF where the background-error covariance is not 
explicitly propagated forward in time but is estimated using the variance of the ensemble 
of background states (Evensen 1997; Reichle et al. 2002). In addition to not needing the 
tangent linear model and adjoint for explicit prognosis of the forecast-error covariance, 
the EnKF does not require the assumptions of linear error growth and normality of error 
distributions (Hamill 2006; Kalnay 2003; Tippett et al. 2003; Reichle et al. 2002). 
Determination of the background-error covariance using the ensemble provides a flow- 
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dependent estimate of the baekground error allowing the EnKF to more optimally update 
the baekground to new observations (Whitaker & Hamill 2002; Hamill 2006). 

Ensemble-based data assimilation ean be put into two eategories, deterministie 
and stoehastie. The basie differenee depends on “whether or not random noise is applied 
during the update step to simulate observation uneertainty” (Hamill 2006). The EnKE 
used for this researeh is a stoehastie data assimilation teehnique, in that it involves an 
ensemble of parallel data assimilation eyeles where eaeh member of the ensemble is 
updated using an observation set perturbed by white noise while still being a plausible 
realization of the observed state of the system (Hamill 2006). The proeess used to 
generate the set of perturbed observations is deseribed below. 

Burgers et al. (1998) showed that for EnKE analysis to work properly, the 
observations must be eonsidered random variables. Otherwise, the ensemble error 
eovarianees (background and analysis) will be underestimated since using the same 
observation to update each member results in spurious correlations. Elnderestimation of 
the error covariances may lead to filter divergence (i.e., the analysis drifts away from 
truth) as observations are underweighted in the update step (Burgers et al. 1998; Whitaker 
& Hamill 2002). Elsing optimal DA, the analysis should typically be more precise than 
the information used to create it. Several methods have been developed to account for 
error covariance underestimation, such as covariance inflation and localization (e.g., 
Anderson and Anderson 1999; Anderson 2003; Houtekamer and Mitchell 1998). 

Importantly, the underestimation problem is a function of the ensemble size used 
during EnKE (Hamill 2006; Whitaker and Hamill 2002; Reichle et al. 2002). Burgers et 
al. (1998) noted that using too small an ensemble resulted in large analysis errors, and 
more benefit could be gained by using an optimal interpolation data assimilation scheme. 
According to Kalnay (2003), research using a quasi-geostrophic model found that 25-50 
members were enough to benefit from using EnKE, but Houtekamer and Mitchell (1998) 
found ensembles on the order of 100 members were necessary. Due to the inexpensive 
computational cost of implementing the EnKE with the E96M, an ensemble size of 500 
members was used for this research. We tested the EnKE over many scenarios starting 

from different locations in the model attractor to ensure filter divergence was not 
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occurring. The Euclidean difference between the best-guess analysis and observation 
veetors and the true state veetor averaged 0.3 and 1.05, respeetively. Thus eovarianee 
underestimation was not a problem, likely due to the large ensemble size chosen. 


The EnKF data assimilation proeess is presented here following notation used by 
Hamill (2006) and is shown in Equation (12) (a)-(e). Eet X*” = (xj’,..., x]^ j deseribe an 

ensemble of baekground state veetors (xj’j with m members in whieh each member’s 

data is a eolumn veetor eovering all state variable values. Ensemble perturbations x'*’ ^ 

— ^xj’ are found in the matrix given by Equation (12) (a). 


from the mean 


i J 


x'''=(x;‘,...,x:), x:‘=x‘ 


’’-x'’, i = (a) 


K = P'’H^ (hp'’HVr) 


(b) 



m-1 


(c) (12) 


x-=x‘+K(y,-H(x‘)) 

(d) 

x‘=m(x:) 

(e) 


The update proeess begins by caleulating an estimated Kalman gain [ K, Equation 
(12) (b)], which gives the optimal weights for the update based on the observation- and 
baekground-error eovariances (Reiehle et al. 2002). To caleulate the Kalman gain, the 

baekground-error eovarianee (P’’) from Equation (12)(o) must be estimated 
diagnostieally from the ensemble of baekground states using Equation (12)(a). The over¬ 
hat (^) is used to denote that the covariance found is an estimate of the true error 
eovarianee sinee the ensemble size is finite. The H-term is the linear transformation 
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matrix used to interpolate model data to the observation loeations and transform the 
model state variables to mateh the observed variable. In this researeh, the observation 
and model data are the same quantity and are collocated, thus H is the identity matrix. R 
is the observation-error covariance matrix describing the typical observation error. The 
superscript ^ denotes the transpose of a vector or matrix. 

With the estimated Kalman gain, Equation (12) (d) is used to update each member 
of the ensemble of background state vectors (x|’ ^ individually using random (stochastic) 

realizations of the observation information (y;) to find the ensemble of analysis state 

vectors (xj* ^. The nonlinear transformation matrix, H, performs the same function as H, 

and was again simply the identity matrix. Following the update step, each member of the 
ensemble of analysis states is integrated forward in time using the full nonlinear model 
(M) to the time of the next observation using Equation (12) (e). The process is repeated 
when new observation data is available. 

For this research, the initial EnKF state estimate (eight variable values) used 

to initialize the spin-up cycle (described below) was taken as an observation of the 
current true state of the system. We created an observation by adding a random draw 

from an ^0, Vr j distribution to the current true state taken from the E96S trajectory. 

The standard deviation of the observation error was taken as 5% of the climatological 
standard deviation in Xj,, thus Vr was a diagonal matrix with 0.2 at all locations on the 
diagonal. We generated 500 additional perturbed observations by adding random draws 
from ^ 0, Vr j to the original observation. These 500 perturbed observations were 

used as the initial EnKF members. The same process was used to produce perturbed 
observations of the current true state for all subsequent filter updates. 

We ran the EnKF through a one-time spin-up cycle consisting of 1000 model time 
steps (0.005 time units each) where perturbed observations were available to update the 
filter every 10 steps, or one data assimilation cycle. Each data assimilation cycle is 
approximately equal to receiving observations every six hours, according to Eorenz 
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(1996) who found that one time unit was approximately equal to five days. The spin-up 
eyele is neeessary to allow the EnKF to aehieve dynamieal stability, ensuring its mean is 
elose to the true state and its perturbations have evolved to more aeeurately estimate the 
baekground error eovarianee. The final foreeast states from eaeh of the 500 EnKF 
members following the spin-up eyele were updated to produee the EnKF analyses used 
for the first ensemble forecast. Additionally, these 500 EnKF analyses were used as the 
starting point for the next data assimilation cycle. 

We chose to separate the ensemble forecast runs by 20 data assimilation cycles 
(i.e., 200 model time steps). The length of this separation period was chosen empirically 
to allow sufficient time for the trajectory to reach a different region of the model 
attractor, thus reducing correlation between ensemble forecasts and producing forecasts 
that span as much of the attractor as possible. We monitored the total vector difference 
between the starting mean analysis and the mean analysis found following each data 
assimilation cycle over a number of cycles for many different starting conditions. The 
vector difference generally increased over a period of 15-25 data assimilation cycles 
before starting to decrease, which led to our choosing 20 cycles as the separation period. 

Following each data assimilation cycle, we took the mean of the EnKF members 
as the best-guess analysis of the current state of the system. Burgers et al. (1998) 
explained that the EnKF mean is a state estimate minimizing the root mean square error 
of the forecast. The best-guess analysis provided the initial condition for the 
deterministic forecast. The 21 ensemble forecast members’ initial conditions were taken 
as uniform random draws (without repeats) from the 500 EnKF members, all of which 
are equally plausible (Hansen 2009). We chose a 21-member ensemble to coincide with 
NCEP’s Global Ensemble Forecast System (GEES), which we used during our value 
studies (Chapter Ill.F). 

Model deficiencies are simulated in the E96M EPS using the perturbed parameter 
approach, which is applied through the simple stochastic parameterization. As described 
previously, a perturbed parameter EPS uses a single NWP model (i.e., the E96M) where 
parameter values within the model are perturbed for each ensemble member (Chapter 

lI.A.2.a). The stochastic parameterization randomly varies the parameter value for each 
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member at every time step while maintaining the deterministic component of the 
parameterization as the average parameter value. 

We also tested a multi-model configuration of the L96M EPS. The parameter 
coefficients were designed to be static for each ensemble member, analogous to the 
deterministic portion of the stochastic parameterization (red line in Figure 7). First, we 
binned the unresolved tendency values (blue dots in Figure 7) across all values using 
a class interval of 0.5. To determine a single member’s coefficients, a uniform random 
draw for each bin was taken from a range equaling four times the standard deviation of 
unresolved tendency values within the bin centered on the bin’s average unresolved 
tendency value. We found each ensemble members parameter coefficients using a 
fourth-order polynomial fit to the values drawn for each bin. The result was n static 
deterministic parameterizations that are similar in nature but perturbed within the known 
uncertainty of the parameter (Figure 9). Testing of the multi-model EPS consistency and 
skill (not shown) showed mostly negligible differences between the perturbed parameter 
EPS and the multi-model EPS configurations. One large difference was found when 
comparing each EPS’s limit of predictability. The perturbed parameter EPS showed skill 
through ten time units, while the multi-model EPS maintained skill through only eight 
time units. We chose the perturbed parameter approach for this research since it was 
previously proven to work well in the context of the E96 system (Wilks 2005). 

4. L96M EPS Performance 

Uncalibrated and calibrated forecast data is used to evaluate the consistency and 
skill of the E96M EPS to ensure it behaves similarly to a real-world EPS. Consistency 
(i.e., statistical consistency) is a measure of how well on average the ensemble forecast 
PDF matches the true forecast PDF (Anderson 1996, 1997; Talagrand et al. 1997). We 
evaluate consistency using the error variance diagram, dispersion diagram, verification 
rank histograms (VRH), and the verification outlier percentage (VOP). The error 
variance diagram is used to understand the predictability and benefit of using the 
ensemble forecast by displaying the average error growth and comparing the limits of 
predictability of the deterministic and ensemble forecasts (Eckel 2008). The dispersion 
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diagram directly compares the mean square error of the ensemble with the average 
ensemble variance at each forecast lead time, where the ratio of these two values should 
equal one for statistical consistency (Eckel 2008). This diagram is also useful in 
diagnosing ensemble dispersion (i.e. rate of change in ensemble spread with increasing 
time) and ensemble spread problems (under or over) (Eckel 2008). The VRH aides in 
visualizing dispersion and consistency characteristics by tracking the location of the 
verifying observation amongst the ranked ensemble members over many trials. Ideally, 
the frequency of occurrence in each rank is equal. Hamill (2001) described interpretation 
of various VRH shapes, but also demonstrated how EPS problems may be masked in the 
VRH by interactions with other issues. We employ VOP as a measure of the ensemble’s 
ability to portray truth by finding the percentage of verifications that fall outside three 
standard deviations from the ensemble mean (Eckel 2008). VOP is calculated as: 


1 M 

VOP =—y 
M tx 




(13) 


M is the total number of verifications, and (cr^ are the mean and standard deviation 

of the ensemble members for a single verification, respectively, and V^ is a single 

verification value. Eower VOP values indicate an ensemble PDF that more consistently 
portrays the true state of the system. 

The error variance diagram created from L96M forecast data (Figure 10) shows 
that E96M accurately models the E96S climatology. Over a long forecast trajectory, the 
deterministic forecast’s error variance should asymptote to twice the climatological 
variance (Eckel 2008), which is seen in the figure. The deterministic limit of 
predictability due to error growth is found at r » 3.8 (r equals forecast time), where the 
deterministic error variance increases above the climatological variance (cr^ ). Once the 

ensemble mean error variance reaches cr^ (r«10.2), the ensemble forecast has lost 
predictability, and a forecast based on climatology is in order. The extension of the 
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ensemble mean error variance above cr^ was not expected, but it is consistent with 
results seen in Tribbia and Baumhefner (2004) using a real-world EPS to forecast 500-mb 
height. Maximum ensemble dispersion is indicated between r » 0.6 and r » 2.0 by the 
average variance between ensemble members, which is a measure of how the ensemble 
members diverge with respect to one another. In a consistent EPS, this measure should 
match the forecast error growth (i.e. rate of increase of deterministic forecast error 
variance) (Eckel 2008). Thus, the E96M EPS is under-dispersive, but it was designed 
this way on purpose to imitate the performance of a real-world EPS. 

Dispersion diagrams are provided for both the uncalibrated (Eigure 11) and 
calibrated (Figure 12) ensemble forecast data (see Chapter III.B.l for specifics on the 
calibration technique). As stated, the dispersion diagram gives a direct look at the 
consistency of the EPS by comparing of mean square error of the ensemble mean and the 
average ensemble variance (Chapter III.B). As expected, the dispersion diagram for the 
uncalibrated data indicates under-dispersion of the ensemble forecast on average. The 
bulk calibration is able to correct for the dispersion deficiencies and give near-perfect 
dispersion at all forecast lead times. VRH for various forecast lead times are provided for 
the uncalibrated (Figure 13) and calibrated data (Figure 14) as well. In Figure 13, the 
E96M EPS displays the characteristic U-shaped VRH of being under-spread, where more 
verifications fall into the outer ranks than expected. The indication of a slight positive 
bias (i.e., more verifications in the left-hand ranks) is also present. This positive bias is 
seen in the L96M error statistics (Chapter III.B.3). Calibration is able to flatten out the 
VRH (i.e. make the rank s more uniform) throughout the forecast period (Figure 14). The 
remaining lack of uniformity seen in the calibrated VRH may be explained by the lack of 
calibration on higher moments of the ensemble PDF. 

The VOP values (Figure 15) indicate the calibrated ensemble forecast PDF does a 
better job representing the true forecast PDF compared to the uncalibrated data. Both 
datasets show low VOP values early in the forecast period, which rapidly increase as 
error growth increases. Since the calibrated data has a better handle on the dispersion on 
average, its VOP value does not grow to the extent of the uncalibrated data. Although the 
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calibrated data shows near perfeet dispersion on average, the VOP does not reaeh the 
perfeet line in Figure 15 sinee dispersion is not perfeet for all individual oases. 

We now evaluate foreoast skill using the entire foreoast dataset to examine the 
performanoe of foreoast probabilities (see Chapter III.B.2 for speoifios on oaloulating 
foreoast probability). In this researoh, two representative event thresholds were ohosen to 
be verified, one to represent a fairly oommon event and the other a rare event, based on 
the olimatology of the L96S (Figure 8(a), page 84). The threshold for the oommon event 
was taken as X =6.31, whioh is exoeeded 30% of the time. The rare event threshold was 
X = 9.98, whioh is exoeeded only 10% of the time. 

We verified probability foreoasts using the Brier Skill Soore {BSS) using sample 
olimatology as the referenoe foreoast (Jolliffe and Stephenson 2003). BSS deoomposition 
provides a measure of the reliability and resolution of the ensemble foreoasts for a given 
event threshold. Taken over many verifioations, reliability is a measure of how well 
foreoast probabilities matoh observed relative frequenoies for the event in question 
(Wilks 2006). For example, over many oases where the probability of ooourrenoe is 20%, 
we expeot to observe (verify) that event 20% of the time. The resolution of the ensemble 
foreoasts provides a measure of how well the foreoasts distinguish between events and 
non-events (i.e. the sharpness of the foreoast PDF) (Wilks 2006). 

The BSS we employed is the deoomposed form, whioh uses disorete, oontiguous 
bins of foreoast-observation data pairs allowing oaloulation of the oomponent reliability 
and resolution values (Eokel 2008). To oaloulate the BSS, we must first define the Brier 
Soore {BS) (Wilks 2006): 


BS = 


J_ 

M 




J_ 

M 


(o,-o)" + o (1-0 ) 


(14) 


M is the total number of foreoast-observation pairs, 1 is the total number of bins, and 
is the number of foreoast-observations pairs in the bin. Also, ( p [). is the 
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representative forecast probability value for the bin (i.e., bin’s average value), o, is 

the observed relative frequency of the bin, and o is the sample climatology. The first 
term on the right hand side of the equation is the reliability {rel) component of the BS, 
while the second term is the resolution {res) component. The final term is a measure of 
the uncertainty (unc) in the forecast of the event in question and is solely dependent on 
the event climatology. BSS may then be computed by (Wilks 2006); 


BSS 


res - rel 
unc 


(15) 


For the common event, the BSS indicates forecast skill through r = 9.6 for the 
uncalibrated data (Figure 16) and r = 10.2 for the calibrated data (Figure 17). 
Calibration appears to have significantly improved the reliability of the ensemble 
forecasts for this event throughout the forecast [Figure 18(a)], while a small and likely 
insignificant improvement in resolution was also seen [not apparent in Figure 18(b)]. 
The combination of improvements provided a small gain in BSS scores throughout the 
forecast, thereby extending the period over which the L96M EPS showed skill. For the 
rare event, the BSS indicates forecast skill through r = 7 for both the uncalibrated (Figure 
19) and calibrated (Figure 20) data. Calibration resulted in an improvement in reliability 
through T = 2.6 , but degraded reliability after that time [Figure 21(a)]. ft should be noted 
that scaling of the figure makes the decrease in reliability appear large, but changes are in 
the thousandths decimal place. Although hard to discern in Figure 21(b), resolution is 
actually improved by the calibration, which likely offset the decrease in skill due to 
worse reliability. 

Based on this analysis of the EPS performance, we have further confirmed the 

ability of the E96M to simulate the E96S climatology and demonstrated the effectiveness 

of the calibration technique used during this research. More importantly, we have shown 

that the E96M EPS appears to behave like a real-world EPS. Additionally, we have seen 

that the uncalibrated and calibrated forecasts for both the rare and common events have 

skill out to approximately seven and ten time units, respectively. This feature plays a 
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crucial role when we consider the value of the ensemble foreeasts and the ambiguity 
information, as it would not make sense to assess the value of a modeling system 
eompared to climatology once the modeling system has lost skill compared to 
elimatology. Taking this analysis in eonjunetion with data proeessing eonstraints, we 
only consider foreeast lead times less than five time units for exploring the researeh 
objectives. 

B. POSTPROCESSING OF ENSEMBLE FORECAST DATA 

This section describes the postproeessing of the ensemble forecast data. We used 
the L96M EPS to generate 3,000 ensemble forecasts each consisting of 8 resolved 
variables, giving a total of 24,000 verifications available at eaeh foreeast lead time. The 
foreeasts were run out to five time units (non-dimensional), and postprocessing was 
performed at a time increment of 0.2 time units, whieh totaled 51 lead times ineluding the 
analysis. For the purpose of determining the L96M EPS error eharaeteristics and 
ealibration ooeffieients, the postprocessing deseribed in the following subsections was 
performed using all foreeast data at each forecast lead time. Verifying observations were 
taken directly from the E96S trajectory without adding error (based on the typical 
observation error) even though erred observations are a source of random error. By 
r = 0.2, the typical observation error was only a small fraction of the total error, thus the 
observation error was inconsequential at later lead times. 

1. Calibration 

We calibrated the E96M ensemble foreeast data to eorrect for systematic errors. 
Once the systematic error is removed, the remaining error is the random error assoeiated 
with the foreeast, which is the primary eause of ambiguity. In this researeh, a simple 
bulk calibration was performed to correet the average errors associated with the first and 
seeond moments of the foreeast PDF. We ehose to use a bulk ealibration technique 
versus a more sophistieated teehnique to allow for a fair eomparison during the estimate 
validation. A ealibration technique that introduces additional information (e.g., 
downscaling) may reduce ambiguity, thus applying a more sophisticated ealibration to 
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one of the practical estimation techniques (discussed in Chapter III.C) without 
performing the same calibration on the theoretical estimation method used as the 
validation standard would bias the results. Calibration was performed at each forecast 
lead time. 

We used a simple shift-and-stretch calibration technique described by Eckel 
(2008). The shift adjusts the first moment of the ensemble forecast PDF by correcting 
each ensemble member individually by the negative of the mean error in the ensemble 
mean defined as; 


ME, 




(16) 


M is the total number of verification points, is the mean of a single ensemble forecast, 
and is the observation. Using ME ,, the shift calibration is performed as follows: 


e, = - ME, , i = \... n 


(17) 


e. is a single shifted ensemble member, e. is a single uncorrected member, and n is the 

number of ensemble members. This approach assumes the bias is the same for each 
ensemble member, making it unacceptable for use with a multi-model EPS. 

The second moment calibration or stretch is performed to increase (or decrease) 
the spread (defined here as standard deviation) of the ensemble forecast PDF in 
accordance with the fractional error in ensemble spread; 


cr 


f 



(18) 
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The numerator is the average ensemble varianee, and the denominator is the mean square 
error of the bias-eorreeted (shifted) ensemble mean, eaeh respeetively defined as: 


(J~„ 


J_ 

M 


I 


1 

n-l 


Z (Kj 


j=i 



(19) 
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( 20 ) 


n is the number of ensemble members, M is the total number of individual verifieations, 
e, is the bias-eorreeted ensemble mean, y. is the observation, and e. ■ is the bias- 

eorreeted ensemble member (Eekel and Mass 2005; Eekel 2008). Eekel and Mass 
showed the n l{n + \) faetor in Equation (20) is required for small ensemble sizes. The 
streteh ealibration is performed using the previously shifted ensemble members: 


* 



cr 


i = \... n 


( 21 ) 


e* is the fully ealibrated ensemble member. 

2. Calculating Forecast Probability 

Unless otherwise noted, we based all foreeast probability ealeulations during this 
researeh on probability of exeeedanee of the event threshold. The results presented 
would not ehange if the probability of preeedenee were used. 

We ealeulated foreeast probability values using the uniform ranks method (Hamill 
and Colueei 1997). Uniform ranks assumes the output from eaeh of the n ensemble 
members for a variable at one grid point is equally likely, or that there is a uniform 
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probability distribution of the rank-ordered values. The total probability is then divided 
into n +1 bins, eaeh with equal probability of eontaining the verifying observation. 

The foreeast probability is ealeulated as the sum of the rank-probability bins 
greater than the event threshold, plus the partial probability of the bin eontaining the 
event threshold. For an event threshold in bins 2 through n-1, the foreeast probability 
(p^) is ealeulated as: 


= Pr (0 < y < e,. ) -I- Pr ( y > e,. ) 
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( 22 ) 


0 is the event threshold value, V is the observation value, e. is the value of the ensemble 
member with rank i, and n is the number of ensemble members (Eekel 2008). This 
proeess is depieted in Figure 22 . 

If the event threshold falls in either rank 1 or ra nk n +1, it is not possible to use 
Equation (22) sinee no ensemble value is available to ealeulate the partial probability. 
Eor example, if 0 lies in rank n + 1, then e-_i = is the largest ensemble value and no 

e. = ensemble member is available. In this ease, the foreeast probability is ealeulated 
as: 


p^= Pr(y>0) 


" 1-G(0) ^ 


1 

n + 1 


(23) 


G ( ) represents the Gumbel eumulative density funetion (CDE) (Wilks 2006) given by 
equations: 
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The Gumbel CDF parameters are estimated using the sample mean (x ) and standard 
deviation (s) of the ensemble members and / = 0.57721 (Euler’s Constant). If 0 falls in 

rank 1, the reverse Gumbel, G' ( ), is used (Eekel 2008). 


= Vv <V < e^) + Pr (y > Cj) 
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- + - 

n + \ n + \ 


(25) 


The Gumbel distribution was ehosen to represent the tails of the ensemble distribution 
beeause of its ability to eapture extreme events. 

3. L96M EPS Error Characteristics 

Estimating ambiguity in the ensemble foreeast requires knowledge of the error 
oharaeteristies of the EPS. Eor the 3,000 foreeast eyeles, each EPS forecast run consisted 
of 21 members describing plausible realizations of the 8 resolved variables at each time 
step. Prom previous discussion, the eight resolved variable forecasts are considered 
independent forecasts and are combined to create a total dataset of 24,000 (3,000x8) 
ensemble forecasts. The postprocessing procedure used for the ensemble forecast data is 
depicted in Pigure 23. 
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Using the large ealibrated dataset, Equations (16), (18), (19), and (20) are applied 
to diagnose the overall or bulk EPS error eharaeteristies, giving the mean error of the 
ensemble mean, fraetional error in ensemble spread, and average ensemble varianee at 
eaeh time step. As stated, uneertainty in the foreeast probability is aetually a funetion of 
the error varianee and not the bulk error. We therefore remove the bulk (i.e. systematie) 
error using the shift-and-streteh ealibration method to reveal the remaining random error 
whieh eontributes to ambiguity. The praetieal ambiguity estimation teehniques 
(deseribed in Chapter III.C) use the error varianee statisties (i.e., the random error) to 
produee their ambiguity distributions. 

To determine the varianee assoeiated with the relevant error statisties, it is 
neeessary to subset the large ensemble foreeast dataset (Eekel and Allen 2009). Eor this 
researeh, we ehose to subset the large dataset of 24,000 verifieations (per time step) into 
3,000 sets of 8 foreeasts, where eaeh set is an individual EPS run. Eaeh ensemble 
foreeast, eonsisting of 21 possible values of the eight variables, deseribes the uneertainty 
about a unique trajeetory within the model attraetor. Errors assoeiated with eaeh 
ensemble foreeast PDE are sensitive to the flow or eurrent loeation in the attraetor. Thus 
sub-setting based on eomplete EPS runs ensures flow-dependent error eharaeteristies 
from different loeations around the model attraetor are used to find the error varianee 
statisties. This sub-setting strategy also follows the analogy of relating the L96M EPS to 
an operational EPS running onee per day. If we assume eaeh L96M EPS run is the same 
as an operational EPS run then we are essentially looking at a 3,000 individual (one per 
day) ensemble foreeasts. The subset error statisties are thus equivalent to determining the 
error on a daily basis, whieh is the same as that ehosen by Eekel and Allen (2009). Onee 

the error statisties {ME-, a' and ) are ealeulated for eaeh of the 3,000 subsets using 

the same equations as above, the varianee of the subset values provides the varianee of 
the error distributions about the bulk values. The E96M EPS error statisties (bulk and 
varianee) for eaeh time step are provided in Table 5. The error statisties indieate that the 
E96M EPS had a small positive bias that was eonsistent throughout the foreeast. The 
fraetional error was less than one throughout the foreeast, but this was expeeted sinee the 
EPS was designed to be under-dispersive to mimie an operational EPS. 
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C. ESTIMATING AMBIGUITY 

This section provides a description of three ambiguity estimation techniques used 
during this research. The first estimation method is a fundamental approach that would 
produce the true distribution of forecast probabilities given unlimited sampling. The 
remaining two techniques estimate the ambiguity based on the error characteristics of the 
EPS and are a practical approach to estimating ambiguity. The explanation of the 
practical methods follows that given in Eckel and Allen (2009) for ease of writing, where 
real-world data from the Japanese Meteorological Agency EPS (hereafter JM) was used 
over the same domain and forecast period described in Chapter III.E. 

I. Ensemble-of-Ensemble 

The ensemble-of-ensemble (EoE) method is the theoretical and impractical 
approach to estimating the ambiguity associated with an operational EPS. The calibrated 
forecast probability (p*) from a non-ideal EPS can be considered a single sample from a 
distribution of true forecast probabilities (pj-), since the ensemble forecast PDE is a 
single realization of many plausible forecast PDEs, given the limited sampling and 
unaccounted for uncertainty in the EPS. The EoE approach builds an estimate of the Pj 

PDE by running N parallel EPSs (termed constituents) each with unique ICs and each 
with unique model perturbations, all of which are similar in nature to the original, control 
EPS. The result is N equally probable forecast PDE realizations for any single forecast 
timeframe, giving A unique Pj samples (i.e., estimates of the true forecast probability). 

The distribution of Pj reflects the uncertainty in the forecast probability (i.e. the 

ambiguity) in the forecast. This approach is unrealistic and absurd for operational use 
given the large computational expense of running multiple, parallel ensemble forecasts, 
and if the computational resources were available, they would be better served improving 
the EPS through additional members and/or higher resolution. 

To produce the EoE ambiguity distribution, we ran multiple parallel ensemble 
forecasts (i.e., different versions of the E96M EPS) from the same control analysis state 
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while allowing initial condition perturbations and model parameterization values to vary. 
Each constituent gave a different yet equally plausible set of n members as well as a 
different forecast probability of occurrence of some event threshold. Taken all together, 
the constituents provided a distribution of forecast probability values (i.e., an estimation 
of the Pj PDF) for a given event that was flow dependent or sensitive to the uncertainty 
in the EPS perturbations. 

An important consideration for the EoE was the number of constituents required 
to provide a thorough statistical sampling of the ambiguity distribution. This question is 
analogous to the problem of determining an appropriate number of members to use for an 
ensemble forecast. Too few constituents may lead to misrepresentation of the desired 
ambiguity distribution even when the distribution the constituents are sampling is perfect 
(Figure 1). For ensemble forecasts, error in forecast probability decays exponentially 
with increasing ensemble size with the most dramatic decrease for sizes ranging from 2- 
20 members, whereas the decrease in error for sizes > 20 members dropped off 
significantly. This suggests that an ensemble size > 20 is needed to reasonably represent 
the underlying true forecast PDF. Flsing this reasoning, it was assumed that an EoE with 
> 20 constituents would adequately represent the ambiguity distribution. As 
computational costs were not a significant limiting factor during the EPS runs, the EoE 
was configured to produce A = 100 constituent ensemble forecasts in order to minimize 
sampling error. 

The setup of the E96M EPS used for the EoE forecast runs is shown in Figure 24. 
In contrast to the setup used for determining the EPS error characteristics, the initial state 
fed into the data assimilation for each of the 100 constituent runs is identical (outside the 
dashed blue box in Figure 24). In this way, the DA process creates a unique set of 
perturbed initial conditions for each constituent based on the same forecast situation. The 
differences in the perturbed states are due to random processes within the DA process 
varying the outcome within the realm of possible analysis error. Additionally, the model 
parameterization values vary randomly (by design in the E96M) throughout the 
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constituent runs. The combination of varying initial condition and model parameter 
perturbations results in varied but equally likely realizations of the uneertainty in the 
future state of the system. 

Creation of an EoE foreeast ease dataset begins similarly to the runs used for 
finding the EPS error eharaeteristics, deseribed in Chapter III.A.3. Erom an initial set of 
500 perturbed observations, we eompleted a spin-up eyele of 1000 model time steps 
updating the fdter with new perturbed observations every 10 steps, hollowing the spin- 
up eyele, the proeess then runs an additional 20 data assimilation eycles (10 model steps 
eaeh) from the final spin-up eyele analyses. The first EoE eonstituent foreeast is run 
using the analyses found at the eompletion of the last data assimilation eyele, where n 
ensemble members are seleeted as before. Eor the next EoE eonstituent, another 20 data 
assimilation eyeles are run onee again starting from the final set of spin-up eyele 
analyses. Thus the next eonstituent foreeast is run over the same foreeast period as the 
previous eonstituent. This proeess is repeated to produee a single dataset of N EoE 
eonstituents for a speeifie foreeast seenario. Subsequent EoE foreeast ease datasets are 
separated from initial state of the previous foreeast ease by the standard separation period 
(i.e., 200 model steps) to find oases from different regions in the E96M attraotor. 

2. Calibrated Error Sampling 

The oalibrated error sampling (CES) method uses information on past 
performanoe of the ensemble to estimate ambiguity. Errors in ensemble foreeast 
probability (p^) may be the result of errors in any moment of the ensemble foreeast PDE. 

Eor CES, we foous on errors in the first two moments, as they are believed to be the 
largest oontributors. Possible errors in the ensemble PDE may be deseribed by error 
distributions for the ensemble mean and spread based on long-term verifieation. Sueh 
error distributions refleet error due to finite sampling as well as error due to unaeeounted 
for uneertainty in the EPS (Eekel and Allen 2009). Eor error in the first two moments, 
we find the error distributions for the mean error of the ensemble mean [ ME -, equation 

(16)] and fraetional error in ensemble spread [cr', equation (18)]. The mean values for 
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the ME- and cr' error distributions are eomputed over a full verifieation dataset, while 

the spread of the distributions are ealculated using subsets of the full dataset, as deseribed 
in Chapter III.B.3. Note that the following explanation of CES follows that given in 
Eekel and Allen (2009) using real-world JM 2-m temperature 5-day forecast data. 

We may estimate the distribution of possible errors by converting potential 
PDE errors into errors. Consider the following example of translating from a PDE 
error to a p^ error using an arbitrary ensemble 2-m temperature forecast, defined as a 
Gaussian with a mean of 2.8°C and a spread of 1.8°C, or N (2.8°C, 1.8°C j (Eigure 25). 
Eor this example, we assume that the true forecast PDE is known (which is generally 
untrue), and it is N (^2.2°C, 2.6°C^. The errors in the ensemble PDE mean and spread 

due to finite sampling and/or ensemble deficiencies are 0.6°C and -0.8°C, respectively. 
The fractional error in spread is thus 1.8 / 2.6 = 0.69. The p^ error can be calculated for 
any chosen event threshold by comparing the ensemble forecast probability and the true 
forecast probability ( pj .). Eor the event of temperature < 0°C, the error is -13.9% since 

p^ = 6.0% and Pj = 19.9% (Eigure 25). 

Performing the same type of calculation over many different event thresholds 
(i.e., different values of 2-m temperature) for the same ensemble and true distributions, 
produces different p^ error values for each threshold chosen [Eigure 26 (a)]. Similarly, 

we may employ a single event threshold while allowing the location of the ensemble PDE 
to vary (i.e., an ensemble PDE with the same mean and spread errors placed in different 
locations with respect to the event threshold). In our example, the positive bias of the 
ensemble PDE results in primarily negative p^ error values (since we are considering 
probability of preceding the event threshold). Positive p^ errors are present for high 

event thresholds once the true PDE’s density becomes larger to the right due to the under¬ 
spread ensemble PDE. When the event threshold moves deeper into the PDE tails on 
either side, p^ error asymptotes to zero as the outcome of the event for both the ensemble 

and true forecast PDEs become more certain (i.e., p^ closer to 0% or 100%). Our goal is 
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to provide an estimate of ambiguity as a funetion of ensemble forecast probability, so we 
replot the results in Figure 26(b) as true probability versus ensemble forecast probability. 

The ensemble PDF in Figure 25 exhibits merely one of many possible errors in 
ensemble mean and spread. Different PDF error values can produce different errors. 

For a given EPS, the spectrum of ensemble PDF errors is described by error distributions 
(one each for ensemble mean and spread) created by evaluating the EPS’s long-term error 
characteristics as described above. To remove systematic error leaving only the random 
error component, we bulk calibrate the forecasts as described in Chapter III.B.l. 
Example error distributions for ME- and cr' are shown in Eigure 27 (a) and (b), where 

the ME- distribution is fit using a normal distribution, and the cr' distribution is fit to a 
gamma PDE. 

CES also requires a distribution for the average ensemble spread, shown in the 
example Eigure 27(c). Error in the ensemble mean affects error, but the actual value 

of the ensemble mean does not impact p^ error. On the other hand, p^ error is affected 
by both error in the ensemble spread and the magnitude of the ensemble spread itself 
Wider ensemble PDEs produce smaller values of p^ error since differences in the 
ensemble and true probability densities become smaller. The distribution of average 
ensemble spread is computed following the same methodology used to find the error 
distributions. We then fit the ensemble spread distribution with a gamma distribution. 

Scatter plots between these various parameters in Eigure 28 show no strong 
correlations between the three variables (average ensemble spread, error in ensemble 
mean, and error in ensemble spread). The spread-skill relationship suggests that the 
variability of errors in the forecast PDE increase with increasing ensemble spread, which 
would result in larger ambiguity. Thus we must determine if a significant correlation 
exists between ensemble spread and the variability (i.e., standard deviation) of errors in 
the ensemble mean and spread. Eigure 29 shows a plot of mean error and spread and 
indicates the variability of both errors remains fairly constant regardless of ensemble 
spread (thus independent). Additionally, the standard deviations of both errors generally 
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match the standard deviation of the error distributions in Figure 27 (a) and (b). The 
spread-skill relationship is likely not seen due to eomputing the domain averaged errors 
for eaeh foreeast ease. Sinee the variables are independent, we ean sample randomly 
from eaeh variables’ distribution to give a set of possible values, whieh may then be used 
in CES. 

To summarize the CES method, for a given set of random samples from 
distributions as in Eigure 27, the error value for a speeifie ealibrated foreeast 

probability value is found as follows. The randomly drawn ensemble mean error and 
ensemble spread values are assumed to describe a Gaussian distribution, where the mean 
error value is an error in loeation away from zero. Using this distribution, the event 
threshold value giving the foreeast probability in question is loeated. This event 
threshold is then used to find the true foreeast probability, where the true PDE is a 
Gaussian distribution with zero loeation error and spread equal to the ensemble spread 
divided by the randomly drawn fraetional error. In this way, the spread of the true 
foreeast PDE will be greater than the ensemble PDE if the fraetional error is less than 
one. The error is then the ealibrated foreeast probability in question minus the true 
foreeast probability. The true foreeast probability is aetually a single estimate from the 
distribution of estimated true probabilities ( Pj ), sinee it was found using a single set of 

error distribution and spread values representing plausible variations in the ensemble 
foreeast PDE based on past performanee. It is important to note that the CES ambiguity 
estimate is not based on knowing the true foreeast PDE, but rather on knowing possible 
values of its mean and spread relative to the ensemble PDE deseribed by the EPS error 
oharacteristies, as well as possible values of ensemble spread. 

In Eigure 30, we see that possible Pj values vary dramatieally for five sets of 
random draws from the Eigure 27 PDEs. Eor p^ = 55%, the Pj values range from 47% 
to 80%, yielding a rough estimate of ambiguity (i.e., the aetual Pj may randomly oeeur 
within that range), booking at the same p^ value (55%), a robust ambiguity estimate ean 
be ereated by repeating the CES proeess using 50,000 random samples from the error and 
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ensemble spread distributions, thus produeing a speeifie ambiguity distribution [Figure 
31(a)]. CES produees an ambiguity distribution for all values of ealibrated foreeast 
probability. We define total ambiguity as the 90% Cl (i.e., the maximum likely value 
minus the minimum likely value) of the distribution of Pj values for a speeifie ealibrated 
foreeast probability value, 


total ambiguity = - p^ (26) 

where p^ and p^^ represent the 5**' and 95**^ pereentile of the rank-ordered Pj. values, 
respeetively. 

Computing the total ambiguity for eaeh ealibrated foreeast probability value 
yields the results in Figure 32, eonveying the general, overall ambiguity of our example 
EPS temperature foreeasts. In general, these results do not make sense when eonsidering 
an ensemble foreeast at a speeifie point, as the results were produeed for all possible 
values of ensemble spread. A speeifie foreeast has a speeifie ensemble spread. Figure 33 
shows that CES runs for fixed values of ensemble spread, but the same variability of 
mean and spread error deseribed by the distributions in Figure 27 (a) and (b), have very 
different amounts of ambiguity. Thus, in real-world applieations, speeifie CES ambiguity 
distributions must be generated for the full range of observed ensemble spread values. 

Therefore, CES takes two forms in this researeh, thus making a distinetion 
between developing the CES ambiguity distributions using randomly varying ensemble 
spread values or using speeifie ensemble spread values. The former, termed CES Global 
(CESq), produees a bulk estimate of the ambiguity distributions for any ealibrated 

foreeast probability, independent of ensemble spread. The later method, termed CES 
Eoeal ( CESl ), provides a flow-dependent estimate of the ambiguity distributions speeifie 
to ensemble spread values. 

CES requires a signifieant up-front eomputational expense to produee the 
ambiguity lookup tables for eaeh ealibrated foreeast probability value. Real-time 
applieation involves simply ealeulating the ensemble spread then aeeessing the lookup 


61 



table to get the ambiguity data. The erux lies in developing error distributions for the 
ensemble mean and spread, as these distributions are likely sensitive to ehanges in 
foreeast lead time, season, loeation, weather patterns, ete. Subsetting of the verifieation 
datasets must be aeeomplished in sueh as way as to avoid eombining dissimilar signals 
while maintaining a large enough sample size to get aeeurate results. 

3. Randomly Calibrated Resampling 

The seeond praetieal method for estimating ambiguity, randomly ealibrated 
resampling (RCR), employs bootstrap resampling, whieh is designed to estimate the 
uneertainty in sample statisties (Wilks 2006). In applieation here, the sample dataset is 
the set of ensemble members and the sample statistie is the foreeast probability for a 
given event. A single resampling of the n-member ensemble values eonsists of making n 
random draws with replaeement resulting in a new version of the dataset and a different 
Pj for the event. Repeating this proeess 10,000 times gives a distribution of Pj values. 
It is important to note that the original p^ from the eontrol ensemble foreeast will be near 
the mean of the resampled Pj. distribution sinee averaging the alternative datasets 

reproduees the original (Eekel and Allen 2009). Note that the following explanation 
RCR follows that given in Eekel and Allen (2009) using real-world JM 2-m temperature 
5-day foreeast data. 

Resampling alone will not provide an aeeurate estimate of the ambiguity 
assoeiated with a given ensemble foreeast, sinee the resampling proeess aeeounts for only 
one souree of ambiguity, finite sampling. The resampled ambiguity distribution is 
dependent on the size of the ensemble used to represent the true foreeast PDE. 
Resampled ensemble foreeasts from a small ensemble are likely to produee very different 
PDEs and subsequently very different Pj values [Eigure 34(a)], resulting in a wider 
ambiguity distribution. The resampled datasets from a well-sampled, large ensemble are 
more likely to give similar PDEs, redueing the range of Pj values [Eigure 34(b)]. 
Resampling does not aeeount for random error due to defieient simulation of sourees of 
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uncertainty. Possible foreeast solutions missing among the members due to defieieneies 
of an imperfeet ensemble would never show up in any resampling, thus the Pj PDF will 
be too narrow (Eekel and Allen 2009). 

Sinee foreeast probability values are eonfined between 0% and 100%, systematie 
bias ean also affeet the width of the PDF. For example, an ensemble with a negative 

bias will shift the Pj PDF towards 0% erroneously, eausing a deerease in varianee as Pj 

values are unable to eross the lower bound. To provide an aeeurate estimate of 
ambiguity, the effeets of random and systematie error must be ineluded. 

Eaeh resampled dataset ean be ealibrated using information from the EPS’s error 
eharaeteristies by applying the ‘shift-and-streteh’ teehnique deseribed previously 
(Chapter III.B.l.). As before, the bulk mean error in the ensemble mean {ME-, 

Equation(16)] is used to eorreet the first moment of the ensemble PDE. Correetions to 
ensemble spread are made using the average fraetional error in ensemble spread [cr', 
Equation(18)] to streteh (or eompress) the bias-eorreeted members about their mean. 
Eaeh resample dataset is ealibrated individually, giving ealibrated foreeast probability 
values for eaeh resample. 

Bootstrap and ealibration aeeount for systematie errors and random errors due to 
finite sampling, but not random errors due to unrepresented sourees of uneertainty. To 
inelude these effeets, the ealibration applied to eaeh resample dataset is varied by using 
random draws from the EPS’s ME- and cr' error distributions. Random ealibration takes 
into aeeount the variation in the ensemble foreeast error statisties, whieh result from the 
EPS’s inability to simulate all of the uneertainty assoeiated with the foreeasts. Thus 
ealibrating based on the random errors brings in possible foreeast solutions that would 
otherwise be absent in the resampled datasets. As the distributions from whieh the 
random deviations are drawn are eentered at the average ME- and cr', the original 

ealibrated foreeast probability value is maintained as the eentral value of the Pj PDE 
(Eekel and Allen 2009). 
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Before the random deviations are drawn from the distributions of ME- and cr', 

we must remove the error varianee due to finite sampling (Eekel and Allen 2009). 
Otherwise, finite sampling will be aeeounted for twiee (i.e., onee by bootstrap resampling 
and onee by random ealibration) leading to overestimation of ambiguity. The sampling 
distributions in Figure 2 refleet the eontributions purely from finite sampling to error in 
the ensemble mean and spread assoeiated with ealibrated PDFs for various ensemble 
sizes. Sinee we are eoneerned with adjusting the raw error distributions, the spread of the 
distributions in Figure 2 must be de-standardized. 

When ealibrating ensemble spread towards an average fraetional error of one, the 
spread of the fraetional error distribution is adjusted by nearly the same proportion as the 
average fraetional error. Thus to de-standardize the spread of the fraetional error 
distribution for an n-member ensemble from Figure 2, we reduee the spread value by a 
faetor equal to the raw average fraetional error (from FPS foreeast verifieation) divided 
by the average fraetional error for an n-member ensemble (from Figure 2). The spread of 
the FPS’s fraetional error distribution is then redueed by the de-standardized spread due 
to finite sampling to give the redueed error distribution for random ealibration. 

For the eontribution of finite sampling for an n-member ensemble to varianee in 
ME-, Figure 2(a) gives a standardized (i.e., ealibrated) value based on 1 / Vn , whieh 
must be inflated by RMSE^ to de-standardize to the ME distribution. The RMSE^ 
represents the best estimate of the standard deviation of the true foreeast PDF ( <jj ), sinee 
both (jj and RMSE^ represent the average error in observations away from the true 
mean (jUj.) or the bias-eorreeted ensemble mean (Eekel and Allen 2009). Therefore, 
RMSE^ / yjn is subtraeted from the standard deviation of the EPS’s ME error 
distribution to arrive at the PDF for random ealibration. Examples of the redueed error 
distributions of ME- and cr' are shown in Figure 35. 

Eaeh resample is thus randomly ealibrated using information on the long-term 

variability of the ensemble’s error, whieh generates a wider Pj. PDF (Figure 36). The 

width of the RCR ambiguity distribution is strongly dependent upon the spread of the 
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original ensemble foreeast, giving the RCR estimate flow-dependent eharaeteristies. An 
ensemble foreeast with less uneertainty (low spread) will typieally have a wider p^. PDF 

eompared to a foreeast with greater uneertainty. For a low spread foreeast, the 
adjustment in loeation for eaeh resample foreeast PDF due to the random ealibration 
results in a larger range of Pj. values for a given event threshold (diseussed in detail in 
the Results ehapter). 

Although RCR appealingly produees a more flow-dependent ambiguity estimate, 
it eomes with a signifieant, real-time eomputational eost. Generating and analyzing the 
resampled datasets ( 10,000 at eaeh grid point for every variable of interest) may be too 
computationally demanding for operational application. 

D. VALIDATION 

There is a fundamental difference between the EoE approach and the other two 
ambiguity estimation techniques. In EoE, the original, calibrated, control forecast 
probability value (p*) represents a single, random draw from the theoretical Pj PDE, 

which is estimated by the EoE Pj PDE. Thus, the original p] can fall anywhere within 
the Pj. distribution. The other two estimation techniques use information on past 
ensemble performance to provide a Pj distribution (i.e., Pj. PDE estimate) that is 
centered on the original 77 *. Because of this difference, our validation efforts were 

confined to determining how well the practical estimation techniques captured the 
variance of the EoE ambiguity distribution. 

The EoE produces a spectrum of possible forecast PDEs and a pj PDE for any 

particular event at some particular lead time, and it dynamically captures the EPS 

limitations (i.e., limited sampling and inadequate simulation of uncertainty). EoE reflects 

the flow-dependent deficiencies in the perturbations associated with the different regions 

in the attractor. CESq , on the other hand, produces a p^. PDE for any particular p ], 

which could come from any event. The Pj distribution is a generic ambiguity 

distribution based solely on the EPS’s average error characteristics, which are taken as 
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the same over the entire attraetor. CES^ produees a somewhat flow-dependent 

ambiguity distribution based on the same general error distributions but dependent on 
speeifie ensemble spread. RCR again uses the same general error distributions, and its 
ambiguity estimate is somewhat flow-dependent sinee the estimate is sensitive to the 
distribution of members in the ensemble PDF. 

The validation strategy eonsiders aggregated ambiguity distributions built over 
many loeations on the L96M attraetor in order to determine the overall effeetiveness of 
the ambiguity estimation methods. This strategy was neeessary beeause the sample PDFs 
used to find the Pj values for the CESq ambiguity distribution were ereated using the 
long-term, average error distributions. Thus the CESq ambiguity distribution refleeted 

the foreeast uneertainty from a eombination of many possible events or loeations on the 
L96M attraetor. We ereated the aggregates by eombining data from all of the EoE 
foreeast eases used for validation into a single dataset. Aeeordingly, the same 
aggregation had to be done for the CESq and RCR datasets. CES^ was developed based 

on the evolution and validation results in this researeh (Chapter IV), whieh unfortunately 
resulted in its omission from the validation study due to time and proeessing eonstraints. 

For this validation strategy, we must eonsider what eaeh ambiguity estimation 
method regards as the expeeted value of its ambiguity distribution, E (Pj,). The 

E [pj) for CESq and RCR is the ealibrated foreeast probability value ( p ]), whieh is 
the best-guess foreeast probability value from the eontrol ensemble foreeast. The EoE 
E (Pj.) is the expeeted value of its Pj PDF, whieh may be very different from p*. 

Thus, to validate CESq , a eertain p* is ehosen and then many eases where EoE E () 
matehes p] are found. The aggregate of EoE Pj distributions from the many eases 
should then mateh the generie CESq Pj PDF. 

The validation approaeh is similar for RCR, with one notable differenee. In the 
ease of CESq where the ambiguity distributions are statie, the EoE data does not have to 
eoineide with any partieular foreeast seenario. RCR on the other hand requires the EoE 
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results specifically coincide with that of the resampled ensemble due to the flow- 
dependent aspect. From an EoE forecast case of N constituents, a single constituent is 
drawn to act as the control ensemble forecast. The control is then used to create the RCR 
ambiguity estimate, which is centered on the calibrated control forecast probability (p*). 
The complete set of N constituents is then used to create the EoE ambiguity distribution. 
Validation is performed where the E (Pj,) for both the EoE and RCR ambiguity 

distributions are equal. As RCR uses random deviates of the long-term average error of 
the EPS, its Pj PDE may be over- or under-spread compared to EoE for any one case. 
Therefore, it is again necessary to aggregate many forecast scenarios from across the 
attractor. Thus for validation, the RCR Pj. distributions and the associated EoE Pj. 
distributions are aggregated separately for comparison. 

These comparisons show how well the estimated CESq and RCR ambiguity 
distributions capture the variance of the EoE ambiguity distribution when the EoE 
E () equals p*. However, we cannot validate the estimation methods’ ability to 

consistently capture the location of the EoE ambiguity distribution. Both the CES and 
RCR ambiguity distributions will be centered on the calibrated forecast probability from 
the control ensemble forecast, which is a random sample from the EoE ambiguity 
distribution, thus a random error in location exists. 

1. Processing of Ambiguity Data 

We used the EoE to create 100 sets of 100 constituent ensemble forecasts to be 
used during validation and value testing. All of the sets were used during the evaluation 
of value, but computational constraints associated with RCR allowed only 20 of these 
sets to be used during the validation of the estimation techniques. Eor each of the 20 EoE 
forecast cases used for validation, the data were processed to ensure comparisons were 
performed using ambiguity distributions with equal expected values. The overall 
processing scheme for the ambiguity data used for validation is shown in Eigure 37. 
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To find the EoE distributions for a single set of 100 eonstituent foreeasts, we 
tested eaeh of the eight variables sequentially aeross the range of foreeast probability 
values shown in Eigure 37. The foreeast probability values tested represent possible p* 
values found using the eontrol ensemble foreeast. We will deseribe the postproeessing 
proeedure for p* =50% and for a single EoE foreeast ease of 100 eonstituents 

here as an example of how all p* values and variables are proeessed at eaeh lead 
time. 

We must determine the event threshold Z-value that yields E ( Pj,) aeross the 100 
eonstituents equal to the foreeast probability value being tested. Thus for variable 
within the set of EoE eonstituents, there exists an Z-value such that the distribution of Pj 
values calculated using that Z-value as the event threshold with each constituent forecast 
creates an E (pj.) equal to 50% within 0.01%. Once this Z-value is located, we know 

the distribution of constituent Pj. values for X^ and p* = 50% in our EoE forecast case. 
The Z-value and Pj. distribution will be different for different EoE forecast cases or even 
different X^, variables within a single dataset. 

We employed an iterative-bisection method for determining the Z-value (Eigure 
38). Here, the control EE, taken as the first constituent of the EoE forecast case, was 
used to find the range of X^ values based on the ensemble members. This range was 

then expanded on either side by an arbitrary amount, Eigure 38(a). We expanded the 
range after initial tests had difficulties converging on an Z-value for extreme forecast 
probability values. The average of the largest and smallest X^ values was then taken as 

the first test value used to calculate Pj. across the constituents, Eigure 38(b). We then 

tested the E[pj.^ from the 100 constituents against the desired p*=50%, which 

resulted in some error value. When the magnitude of the error was too large compared to 
the tolerance (set to 0.01%), we used the signed value of the error to determine which 
direction to move when adjusting the range of Z-values used for determining the next test 
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Z-value, Figure 38(c). The process repeats until the algorithm converges. The final 
output available for use during validation is a distribution of 100 possible Pj values for 

pI = 50% and for this EoE forecast case. Again, this process was repeated for 

all pI values listed in Eigure 37 for each at all required lead times for every EoE 
forecast case. 

Application of CESq during the validation process required a control forecast 
probability value about which the appropriate preprocessed CESq ambiguity distribution 
was placed. We took the control ensemble forecast for CESq as the first constituent 
from a set of EoE forecasts. The CESq static ambiguity distribution associated with each 
possible pI value was found using the process described in Chapter III.C.2. 

Similarly to CESq , the RCR estimation required defining a control ensemble 

forecast, which again we took as the first constituent of each EoE set. The resampling 
process was then performed using the n ensemble members. Eor RCR, we used the 
uncalibrated control ensemble forecast data, since each resample must be calibrated using 
randomly drawn calibration coefficients. Once each of the 10,000 re-sampled ensemble 
datasets had been randomly calibrated, we employed the iterative-bisection method again 
to find the A-value giving E () equal to some desired within 0.01% error. After 

converging, the desired RCR ambiguity distribution consisting of 10,000 pj values was 
known. Since this process had to be completed for each variable within each EoE set at 
every lead time for every desired p] value, the processing time was extraordinary. 
Therefore for this research, we confined the RCR calculations and thus validation to 
forecast lead times out to 5 time units at a time increment of 0.2 time units, and 
processing was only accomplished for a limited number of p] values (see Eigure 37). 

Additionally, we did not perform validation of the later forecast lead times as changes to 
the ambiguity distributions from all three estimation techniques were insignificant 
beyond 5 time units. 
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2. 


Comparing Ambiguity Estimates 


The forecast times and probability values available from the computationally 
expensive RCR data constrain the comparison of the ambiguity estimates from the three 
methods. Thus, we made comparisons only through forecast lead times of 5 time units at 
a time increment of 0.2 and for p] values listed in Figure 37. Comparisons were made 

using total ambiguity [Equation (26), page 61]. So, for each p] value at each time, we 

found the upper and lower bounds of the 90% Cl for each estimate type. As described in 
the previous section, the expected values of the estimated ambiguity distributions match 
by design, so comparing the 90% Cl ranges provides a measure of the similarity in the 
variance of the ambiguity distributions. Even if the total ambiguity is equal, we cannot 
conclude that the ambiguity distributions are the same, since one of the distributions may 
exhibit differences in higher moments. Thus we are limited to validating only the 
variance of the ambiguity distributions. 

In accordance with the validation theory, we validated specific p] values using 
aggregates of the EoE and RCR ambiguity distributions. Since 20 EoE sets are used, this 
resulted in an EoE distribution of 16,000 Pj values and an RCR distribution of 

1,600,000 Pj values. The CES Pj distribution contained 50,000 values. The lower and 
upper Cl bounds for a certain distribution were found by sorting the Pj values into 

ascending order and taking the 5*- and 95*’'-percentile based on the size of the dataset, 
respectively. We then computed the total ambiguity for the distributions as the upper 
bound minus the lower bound. We compared the CESq and RCR ambiguity estimates to 

the EoE “standard” by subtracting the EoE estimate from the CESq or RCR estimates. 

E. VALUE USING UNCERTAINTY-FOLDING 

We applied the uncertainty-folding approach to ambiguity distributions developed 
using the EoE, CESq and RCR estimation techniques. Additional testing was performed 

using what is termed the grand ensemble, which consisted of combining the ensemble 

members from the 100 constituents for a given EoE forecast case into a large 2100- 
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member ensemble. The grand ensemble may provide evidenee that EPS designers would 
be better off alloeating resourees towards improving the EPS (i.e., running more 
members) versus devoting resourees to implementing the impraetieal EoE teehnique to 
estimate ambiguity. We ran tests using all of the 100 EoE eonstituent foreeast oases with 
the two event thresholds previously desoribed (Chapter III.A.4). Thus for eaoh event 
threshold, we had 800 oontrol (p*) and grand {Pg) ensemble foreeast probabilities and 

800 unoertainty-folding foreeast probabilities (i.e., one for eaoh of the eight variables in 
eaoh of the 100 oonstituents) for eaoh ambiguity estimation method available at each 
forecast lead time. As before, the control ensemble forecast was always taken as the first 
constituent of each EoE forecast case. We confined the analysis of value to lead times up 
to 5 time units because changes in the ambiguity distributions were insignificant beyond 
this time and because the E96M EPS was shown to lose skill shortly after this time. 

Eor a specific forecast case, we developed the EoE ambiguity distributions by 
determining the forecast probability for the two selected thresholds for each of the 100 
EoE constituents. The CESq and RCR ambiguity distributions were found using on the 
control ensemble forecast (members from constituent #1), using the same procedure 
described in Chapter III.C. Eor the grand ensemble, we found a single value using 

uniform ranks with the 2100-member ensemble for each of the event thresholds. To find 
the grand ensemble’s p ^, each of the constituent forecasts was calibrated separately 

using the average error characteristics for the 21-member E96M EPS. It may have been 
more appropriate to combine the constituent members and then calibrate using calibration 
coefficients for a 2100-member L96M EPS, but the computational expense of finding the 
error characteristics prevented this approach. We then applied uncertainty-folding 
(Chapter II.C.l) with the EoE, CESq and RCR ambiguity distributions to find the p^ 
value associated with each estimate. 

We analyzed value using an extension of the optimal VS, called the integrated 
optimal VS (lOVS): 
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V- 0 : ySi^O 

IOVS = yAx\ ‘ ,/= 0.005... 0.995 

V \vsr. ra, >0 


( 27 ) 


where Ax = 0.01 and VS. is the value seore attained using C/L = i as the deeision 

threshold. The summation eomputes the positive area under the VS eurve for a given 
foreeast teehnique at a speeifie lead time (Figure 39) by breaking the area into seetions of 
width Ax and length VA,.. In this approaeh, the optimal VS found using p* from the 
eontrol ensemble, from the grand ensemble or from EoE, CESq or RCR at a 

speeifie lead time is integrated aeross all C/L giving a single lOVS value for eaeh souree 
at eaeh foreeast lead time. 

Using lOVS allowed uneertainty-folding from EoE and RCR and the foreeast 
probability from the grand ensemble to be easily eompared for all lead times. Eor 
eomparison, we standardized the lOVS values assoeiated with the ambiguity estimation 
teehniques and the grand ensemble with respeet to the lOVS based on the eontrol foreeast 
probability from the first eonstituent in eaeh EoE foreeast ease. Thus, seores greater than 
one indieate improved value over the eontrol ensemble foreeast, while seores below one 
indie ate a reduetion in value. 

F. VALUE USING SECONDARY DECISION CRITERIA 

We undertook the study of value assoeiated with applieation of seeondary 
deeision eriteria using a real-world operational EPS. In the study, we developed a 
proeess for applying ambiguity information towards improving the seeondary eriteria of 
minimizing repeat false alarms at all loeations (i.e., grid points). We used the CES^ 

ambiguity distributions for this portion of the value study beeause it is the most praetieal 
approaeh to use operationally over a large domain. Thus we attempt to use ambiguity 
and deeision thresholds, both based on past performanee, to add value in a real-world 
deeision eontext. 
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1 . 


Description of Real-world EPS and Ground Truth Data 


We obtained historical ensemble forecast data from the THORPEX Interactive 
Grand Global Ensemble (TIGGE) database. TIGGE is a collaborative project where 
ensemble data is made available for scholarly research in support of the THORPEX goals 
of improving accuracy of 1-day to 2-week forecasts (WMO 2009). The database holds 
surface and upper level variable data from ten operational centers from around the globe 
dating as far back as 2001. We retrieved the ensemble forecast and ground truth data 
through the ECMWF TIGGE internet portal (TIGGE 2009). The following EPS and 
model descriptions were also obtained from the TIGGE portal. 

Eor the secondary criteria value studies, we chose the Global Ensemble Forecast 
System (GEES) provided by NCEP. GEES is a 21-member, single-model ensemble 
based on the NCEP Global Forecast System (GFS). The model horizontal resolution 
provided is T126 (or -'110 km) with 28 vertical levels. GEES forecast data is provided 
on a 1° X 1° grid initialized daily at OOZ, 06Z, 12Z and 18Z over the forecast period T+O 
to T-l-384 hours at a six-hour increment. Initial condition perturbations are produced 
using an ensemble transform method that incorporates regional rescaling of perturbations 
with an optimization period of 48 hours. At the time of the study, GEES contained no 
model perturbations or surface boundary conditions perturbations, thus ignoring a 
significant source of uncertainty. Due to the limited number of members and the lack of 
model perturbations, we expected the ambiguity associated with GEES to be high, which 
is why it was chosen for this portion of the research. 

We focused the secondary criteria value studies on GEES 120-hour (T-l-120) 
forecasts of 2-m temperature over a CONUS domain with comer points 50N, 125E and 
24N, 66E. Based on the l°x 1° grid, this gave us 1,620 forecast-verification points per 
forecast case, where independence among data points was assumed. Two separate 
verification periods were used during this study. The first was a training period where we 
verified the GEES forecasts to determine the calibration coefficients as well as the 
forecast error characteristics used for estimating ambiguity. The training period ran from 
15 Dec 2007 to 15 Feb 2008 using only 12Z forecasts for a total of 63 forecast days 
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during the winter season. The period was ehosen to avoid seasonal transitions where 
error eharaeteristies may ehange dramatieally over short periods of time, while providing 
a large enough dataset for robust estimates of the error eharaeteristies. The seeond period 
was an independent applieation dataset where we performed the value studies employing 
the error eharaeteristies obtained during the training period. We therefore had to assume 
stationarity and seasonal dependenee of the systematie and random error in order to apply 
the error statisties to the applieation period. The applieation period eovered 1 Jan 2009 
through 31 Jan 2009 using only 12Z foreeasts for a total of 31 foreeast days. 

To determine the error eharaeteristies assoeiated with the GETS EPS, we ehose 
the ECMWF global model analysis (T+0 hours) as the ground truth to use for 
verifieation. The ECMWF analysis, originally run at horizontal resolution T799 (or ~ 25 
km) with 91 vertieal levels, is arehived on the TIGGE portal using an N200 redueed 
Gaussian grid. We requested the data on a 1° x 1° grid through the TIGGE portal, where 
the portal software automatieally interpolates the data to the user’s requested format 
using a bilinear interpolation (Fuentes 2008). We retrieved 12Z 2-m temperature 
analyses for 20 Deo 2007 through 20 Feb 2008 for a total of 63 days over the training 
period. Analyses for the applieation period oonsisted of 31 days from 6 Jan 2009 through 
5 Feb 2009 for 12Z. 

2. Metrics used in Secondary Criteria Value Study 

From previous disoussion, we want to add value to the seeondary eriteria while 
leaving the primary value signifieantly unehanged. The primary value of the foreeasts 
was established using the optimal VS. The main seeondary value metrie was simply the 
number of repeat false alarms. Other metries (deseribed below) were used to ensure the 
primary value of the foreeasts was not signifieantly degraded. We eompute all metries to 
diagnose any ehange in primary and seeondary value for deeisions based on the eontrol 
ensemble foreeast alone versus eonsideration of the ambiguity information. 

In order to ensure the primary value was not signifieantly redueed, we used the 
following additional metries. Probability of detection {POD) is defined as the proportion 
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of correctly forecast occurrences (Jolliffe and Stephenson 2003). Based on the 
nomenelature introduced in the eontingeney table (Table 2, page 31), 


POD = (28) 

a + c 

A metrie not normally found in verifieation readings was defined for this researeh, the 
probability of missed detection (POMD). This metrie is defined as the proportion of 
incorrectly forecast occurrences. 


POMD = —^ (29) 

a + c 

The significanee of changes to the metries was evaluated using the 95% Cl about 
the scores. For determining optimization, we considered a change insignificant if the 
expeeted value of metric from the alternate behavior fell within the upper and lower 
bounds of the eontroTs 95% Cl. 

3. Secondary Criteria Value Study Scenario 

As deseribed above, we used the seeondary eriteria of reducing the number of 
repeat false alarms for this research, where a repeat false alarm is defined as two 
unneeessary protections in a row at a specific forecast location. Repeat false alarms were 
chosen because of the tangible and intangible effects they may produee, such as loss of 
customer eonfidence in the forecast and degraded mission effectiveness, among others. 
While a user may employ the ambiguity information to help prevent repeated misses, we 
eonsidered only repeat false alarms as a secondary criterium. As this was a preliminary 
study into assessing the value assoeiated with seeondary eriteria, we ehose to focus on a 
single criterium to show the potential benefits of using ambiguity information in the 
deeision making proeess. 


75 



Here, the goal was to minimize the total number of repeat false alarms over the 
entire domain by allowing the user to ineorporate the ambiguity information to alter the 
deeision to protect at any grid point where the previous consequence was a false alarm 
(i.e., an unnecessary protection). In other words, at a specific grid point in the domain, if 
the previous forecast at that location resulted in a false alarm, the user may change the 
current “protect” decision to “do not protect” if the ambiguity distribution indicates that 
the decision input is unclear (i.e., overlap exists). The occurrence of any other 
consequence following the first false alarm breaks the sequence preventing the user from 
reversing decisions. An important aspect of the study is that we must maintain the 
primary value associated with using the best-guess forecast probability to minimize 
expected expense while simultaneously gaining value based on the secondary criteria. 
The metrics used to monitor the primary value were discussed in Chapter III.F.2. Using 
the 2-m temperature data, we focused the study on the event threshold of temperature 
< 0°C, which is critical to a variety of users in the real world. It is easy to imagine the 
issue of repeat false alarms extending to any event of interest. 

In this scenario, we explored several possible decision rules employed by the 
forecast user when the chance of having a repeat false alarm is possible (Table 6). 
Decisions based on users who follow these decision rules are evaluated in relation to a 
normative user who consistently makes decisions based on the best-guess ensemble 
forecast (the ‘Control’ user in the table). The basic decision flow for using the ambiguity 
distribution overlap to reduce repeat false alarms is shown in Figure 40. It’s important to 
emphasize that changes to the decision can only happen following a previous false alarm 
at the same grid point. When an opportunity to reverse the decision arises and the 
decision is unclear, the overlap is compared to an overlap threshold value to determine if 
the user’s action will be changed. If the current overlap is greater than the overlap 
threshold value, the user will reverse the decision (i.e., choose not to protect). 

The conceptual model, depicted in Figure 41, was our first guess at how the 
overlap threshold value should vary according to C/L. We assumed that high C/L users 
needed to reverse the decision less often (i.e., higher overlap threshold) since they are 
generally not as concerned about false alarms, while users with low C/L may be anxious 
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to reverse the decision since false alarms may occur frequently. For a low C/L user, the 
forecast probability is more apt to indicate that protective action is required, which is 
likely to result in more negative consequences (i.e., false alarms and repeat false alarms). 
Thus we assumed the low C/L user will want more leeway (i.e., smaller overlap 
threshold) when the option to reverse the decision is available. 

We found the empirical optimal overlap threshold by building contingency tables 
for overlap thresholds varying from 50% to 0.5% at an increment of -0.5% for each C/L 
while also measuring secondary criteria value. When using one of the practical 
estimation methods (e.g., CES^) to create the ambiguity distribution, the best-guess 
forecast probability value is generally located at the center of the distribution, thus the 
maximum overlap value is taken as 50%. An overlap greater than 50% would necessarily 
result in a different initial decision, and the user would not have the option to change (i.e., 
the current decision would be to not protect). 

For each C/L, primary value metrics (VS, POD, and POMD) based on the control 
user were compared with the metrics computed using the overlap thresholds with the 
ambiguity information to reverse appropriate decisions. For example, at C/L 1%, the 
control user’s metrics were first compared to metrics found using an overlap threshold of 
50%. If no significant difference (defined in Chapter III.F.2) was found, 50% was stored 
as the optimal overlap threshold. Primary value metrics based on subsequent overlap 
thresholds (49.5% to 0.5% at a -0.5% increment) were also compared to the control user 
until a significant difference was found, at which point the optimal overlap threshold was 
taken as the previously stored value. The comparison process (Figure 42) repeated at 
each C/L resulted in an empirically derived optimal overlap threshold value for each C/L 
representing the lowest overlap threshold value that did not significantly degrade the 
primary value. This optimal overlap threshold has the potential to deliver significant 
changes to and add value for our secondary criteria. 

4. Processing of Real-World EPS Data 

Using the training dataset of 63 days of 2-m temperature forecasts and 

observations, we processed the data using the same procedures described previously for 
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determining the error charaeteristies of the L96M foreeast data (Chapter III.B.3). The 
GETS EPS error eharaeteristies (Table 7) were used to provide ealibration eoeffieients for 
both the training dataset and the independent applieation dataset. The bulk-ealibrated 
training dataset gave the random error distributions assoeiated with the GEES foreeasts, 
whieh were used to generate the required CES^ ambiguity distributions. The varianee of 

the error distributions was determined using subsets based on forecast days to capture the 
likely flow-dependent sensitivities and the EPS’s inability to adequately sample IC and 
model errors. 

The calibration coefficients (shift = 0.0319°C, stretch = 1.64) derived from the 
training dataset indicated that the GEES forecasts were on average negatively biased and 
under-spread. Calibration of the training dataset resulted in a ME- of zero and a 

fractional error in ensemble spread of 0.976 (increased from 0.596), thus even with 
calibration, the ensemble forecasts were still slightly under-spread. Using the reliability 
diagrams for the raw and calibrated training dataset forecasts (Eigure 43), we see that the 
calibration improved the reliability and the forecasts are now highly reliable. We used 
the reliability diagram to compute the reliability and resolution components of the BSS. 
The reliability {rel) was improved from 2.04x10^^ for the raw data to 1.23x10 after 
calibration. Erom the bin usage histograms, it appears calibration marginally decreased 
resolution {res) (i.e., more forecasts falling outside bins 1 and 11), but both the raw and 
calibrated data had res equal to 0.186. A decrease in resolution was expected since the 
spread of the forecasts was increased (i.e., made less sharp) during the calibration 
process. Overall, the calibrated forecasts of 2-m temperature at 120-hr were quite skillful 
with a BSS of 0.756 (increased from 0.752 for the raw data) when compared to the 
sample climatology. 

We also calibrated the application dataset using the coefficients given above, 
which resulted in a ME- of -0.124°C (shifted from -0.156°C) and a fractional error of 

1.015 (increased from 0.620). Thus the independent forecasts were still negatively biased 
after calibration, but the ensemble spread was increased slightly too much, booking at 
the reliability diagrams in Eigure 44, we see that the calibration performed quite well. In 
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general, the assumption of stationary (systematie) error eharaeteristies appeared to hold 
(i.e., differenee in ME- between the datasets is less than 0.1 °C and the spread eorreetion 
resulted in near perfect fractional error). Because of the stationarity in systematic error, 
we assume similar stationarity in the random error, indicating that ambiguity estimates 
computed using the training dataset should apply well to the application dataset. 
Component rel values for the raw and calibrated data calculated from the reliability 
diagrams were 3.72x10^^ and 1.20x10 \ respectively, showing the improvement in 
reliability. The res components for the raw and calibrated data were 0.169 and 0.167, 
respectively, reflecting a slight decrease in resolution. Thus the calibrated forecasts in 
the application dataset are highly reliable and display fairly high resolution compared to 
the maximum res of 0.245 possible based on the uncertainty (unc) [Equation (14), page 
47] of the forecast associated with the sample climatology. The BSS for the calibrated 
dataset was 0.677 (increased from 0.673), which indicates quite skillful forecasts that 
should provide value. 

From the description of the NCEP EPS, we expect that ambiguity may be high 
due to the limited number of ensemble members and its complete lack of model 
perturbations, but we have seen that ambiguity also varies by forecast error growth and 
the ensemble spread. We determined the stage of forecast error growth for the 5-day 
forecasts by comparing the MSE of the control forecast (i.e., the first ensemble member) 

to the climatological variance (crc) ^-m temperature taken over the sample dataset. 

Our comparison of cr^ (154.44) and MSE^^^ (20.00) resulted in a value of 12.94%, 
indicating that the GEES forecasts were on average still in the early stages of error 
growth (i.e., since MSE^^^ may grow to as large as twice cr^). This result provided more 

evidence to support our assumption of high ambiguity in the ensemble forecasts of 2-m 
temperature at the lead time chosen, as the average ensemble variance at this time would 
still on average be relatively small (discussed in detail in the Results chapter). 

For this value study, we employed ambiguity distributions created using CES^ 
(Chapter III.C.2), which was configured to produce 50,000 Pj. values for all forecast 


79 



probability values from 0.5% to 99.5% incremented by 0.5%. Ensemble spread values 
were binned using a class interval of 0.1 °C over the range of values from 0-11°C to bin 
forecasts exhibiting similar uncertainty. In application, any ensemble spread value 
greater than 11°C used the 11°C bin. The resulting CES^ ambiguity distribution tables 

were provided at a 1% interval from l%-99%, with a specific table for each combination 
of forecast probability and ensemble spread. All together, there are 2,231,100 elements 
in the tables, which is much too large to show here, but Eigure 45 and Table 8 display 
some sample data. In Table 8, which shows a sample of three ambiguity distributions for 
p* = 15%, we see the distributions narrow for forecasts with larger ensemble spread. Eor 

example, the ambiguity distribution for a^=2 °C ranges from 0% to above 55%, while 
the distribution for cr^ = 8 °C ranges from 0% to 40%, as seen in Eigure 45. 

We used the training dataset to determine the empirical overlap threshold value 
for each C/L (at an increment of 0.01) using the method described above (Eigure 46). We 
computed the first-order and secondary criteria value metrics based on the “control” user 
as well as for each of the possible overlap threshold values (50%-0.5% at -0.5% 
increments). Eor each C/L, we then compared the metrics for each overlap threshold 
against the control’s scores to find the optimal overlap threshold (Eigure 42). An 
example of the comparisons made for C/L 0.01 is shown in Eigure 47. Eooking at the VS 
alone in this example [panel (a)], we would conclude that the optimal overlap threshold 
was 21.5%, but this threshold still shows a significant difference for the POD and POMD 
metrics [panels (b) and (c)]. The lowest overlap threshold value where no significant 
difference exists for all three metrics is 31.5%, which was taken as the optimal overlap 
threshold value for C/L 0.01 (Eigure 46). 

The resampling process was not entirely straightforward for this study. Eor 
metrics such as VS, POD and POMD, we were able to use the standard approach to 
resampling, where resample draws may be taken from any grid point on any day. Eor 
example, using the training dataset (63 days x 1620 grid points = 102,060 forecasts) 
resampling may be accomplished by placing the forecasts sequentially in a single column 
vector and performing 102,060 resamples with replacement for each resampled dataset. 
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For the secondary criteria value metric of repeat false alarms, the sequential resampling 
method proved to be inadequate, where results showed that the score for the original 
dataset was an extreme outlier from the resampled datasets. For repeat false alarms, we 
found it was necessary to maintain the time-series of forecast-observation pairs at each 
grid point when performing the resampling. In other words, resampling was performed 
using only the 1620 grid points (i.e., only 1620 draws with replacement performed), but 
when a location was drawn, its entire time-series of forecasts and observations over all of 
the forecast days was taken. This process maintained the consistency of repeat false 
alarms at a single grid point while allowing the total number of repeat false alarms to 
vary over the domain for each resampled dataset. The time-series resampling alleviated 
the problem of the control being an outlier for the secondary criteria value metric, but we 
found that it underestimated the first-order value metrics. This is most likely because 
time-series resampling effectively reduces the variance of the results since as it pulls 
large chunks of data with each draw, and the associated uncertainty within each chunk is 
not sampled. 

The optimal overlap thresholds shown in Figure 46 are close to the reverse of our 
original conceptual model, indicating that low C/L users require a higher overlap 
threshold than mid to high C/L users. As the C/L increases into and beyond the mid¬ 
range values, the certainty of the forecast required to take protective action increases, 
which decreases the likelihood of false alarms and repeat false alarms. For these users, 
the size of the overlap was less important, as any overlap threshold used resulted in 
minimal and insignificant changes to the primary value because the difference in expense 
between a false alarm and a miss is small (C«L). Thus our algorithm resulted in 
smaller values for the optimal threshold. The low C/L users required a larger overlap 
threshold as a consequence of the large number of opportunities to change, since 
changing too often with a small overlap threshold likely resulted in an increase in misses, 
which significantly degraded the scores based on the primary value metrics. 

Once the optimal overlap was determined, we then used the application dataset to 
compute the primary and secondary criteria value metrics using all of the decision rule 
types described in Table 6. The optimal and conceptual model overlap thresholds were 
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applied to the foreeasts using ambiguity distributions based on the CES^ technique as 

before. Results from each of these decision rules was then compared to the control user 
to find any improvement in the secondary criteria while not significantly altering the 
primary value, and these comparisons are reported in the Results chapter. 
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Figure 6. Lorenz 96 System sehematie with 8 resolved variables (large eireles) and 256 
unresolved variables (small eireles). The unresolved variables are grouped with 
the resolved variable to whieh they belong in sets of 32 [From Wilks 2005]. 
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Figure 7. Scatterplot of the unresolved tendency U from all resolved variables as a function 
of the resolved variable. The fourth-order polynomial regression best-fit (solid 
line) is the deterministic portion of the parameterization. The average variance of 
U across all X values about the best-fit line is used for the stochastic portion of the 

parameterization. 



Figure 8. Probability density of resolved {X^) variable using (a) L96 System, (b) L96 

Model with deterministic parameterization, and (c) L96 Model with stochastic 

parameterization. 
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Figure 9. Multi-model EPS deterministic parameterizations. The solid line is the 
deterministic portion of the stochastic parameterization shown in Figure 7. 
Dashed lines are static deterministic parameterizations, where each is associated 
with a specific ensemble member. Only ten members are shown for clarity. 
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Forecast Lead Time (non-dim) 


Figure 10. Error variance diagram using L96M deterministic and ensemble forecast data 

from 24,000 forecast-observation pairs. 
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Figure 11. Dispersion Diagram using uncalibrated L96M EPS foreeast data from 24,000 

foreeast-observation pairs. 



Figure 12. Dispersion diagram using calibrated L96M EPS foreeast data from 24,000 

foreeast-observation pairs. 

87 



Analysis 


005 

004 
o 

o 003 

2 002 
u. 

001 
0 

2 4 6 8 10 12 14 16 18 20 22 

Venfica4ion Rank 


r = 2 

01 

U 008 

2 006 
o 

? 004 
002 
0 





Venfication Rank 


Figure 13. Verification ra nk histograms using uncalibrated L96 EPS ensemble forecast data 
from 24,000 forecast-observation pairs for various forecast lead times. The solid 
red line indicates the uniform probability of any rank given a 21-member 
ensemble. The dashed red lines are the bounds of the 95% Cl about the uniform 
probability given the number of ensemble forecasts (M). (Continued, next page.) 
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(Figure 13, continued.) 
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Figure 14. Verification rank histograms using calibrated L96 EPS ensemble forecast data 
from 24,000 forecast-observation pairs for various forecast lead times. Same as 

Figure 13. 
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(Figure 14 continued.) 
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Figure 15. Comparison of Verification Outlier Percentage (VOP) values based on the 

uncalibrated (solid) and calibrated (dot-dash) L96 EPS ensemble forecast data 
from 24,000 forecast-observation pairs. The perfect VOP-line of 0.26% is shown 

by the dotted line. 



Figure 16. Brier skill score (BSS) for the common event using uncalibrated L96 EPS 

ensemble forecast data from 24,000 forecast-observation pairs. Error bars created 
using bootstrap resampling represent the 95% Cl about the BSS value at each 
forecast lead time. The dashed line is the zero-skill line. 
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Figure 17. BSS for the eommon event using calibrated L96 EPS ensemble forecast data from 

24,000 forecast-observation pairs. Same as Figure 16. 
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Figure 18. Comparison of (a) reliability and (b) resolution components of BSS for both 
uncalibrated (blue solid line) and calibrated (red dashed line) for the common 

event. 
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Figure 19. BSS for the rare event using uncalibrated L96 EPS ensemble foreeast data from 

24,000 forecast-observation pairs. Same as Figure 16. 



Figure 20. BSS for the rare event using calibrated F96 EPS ensemble forecast data from 

24,000 forecast-observation pairs. Same as Figure 16. 
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Figure 21. Comparison of (a) reliability and (b) resolution eomponents of BSS for both 
uncalibrated (blue solid line) and calibrated (red dashed line) for the rare event. 



Figure 22. Uniform Ranks method. Caleulating forecast probability for X > 5.0 using a 10- 
member ensemble. The probability value of 77% is represented by the hatched 

area [After Szczes 2008]. 
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Figure 23. Postprocessing steps for L96 EPS Data. 
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EoE operation: Itonite proem 

Figure 24. L96M EPS EoE Schematic. After the random starting state is determined, this 
state is integrated forward through the data assimilation and forecast periods using 
the L96S. The process inside the dashed box is repeated N times using the E96M 
with the same random initial state to generate the EoE constituents. 
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(a) 



Temperature (°C) 



Figure 25. Example comparison of a true and an ensemble forecast PDF (a) and CDF (b) 
defined as N{2.2°C, 2.6°C) and N(2.S°C, 1.8°C) respectively. An error of-13.9% 
in pe for the chance of temperature < 0°C is the difference in the PDFs’ shaded 
areas, or the difference in the two CDFs (double arrow) [From Eckel and Allen 

2009]. 
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Figure 26. (a) Error in for a range of temperature values for the event threshold, ealeulated 

as the differenee in the two CDFs of Figure 25. The top axis is the nonlinear 
seale. (b) Plot of p^ vs. true foreeast probability (solid), where the dashed line 
indieates perfeet eorrelation [From Eekel and Allen 2009]. 
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Figure 27. Histogram and fitted PDFs of results from an example bulk-calibrated ensemble 
forecast dataset for (a) error in ensemble mean, (b) fractional error in ensemble 
spread, and (c) ensemble spread. The data are based on statistics from the JM 51- 
member EPS. The domain and forecast period are the same as described in 
Chapter III.F. [From Eckel and Allen 2009]. 


(a) 



(b) 



(c) 



Ensemble Mean Error (°C) 


Ensemble Spread (°C) 


Ensemble Spread (°C) 


Eigure 28. Scatter plots showing relationships between the variables in Eigure 27. 

Correlation coefficient (r) is inset in each plot [From Eckel and Allen 2009]. 
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Figure 29. Relationship of ensemble spread with variability (standard deviation) of (a) 
ensemble mean error and (b) fraetional error in ensemble spread. Solid line in 
each plot indicates the standard deviation of the error distributions in Figure 27 
(a) and (b). [After Eckel and Allen 2009]. 



Figure 30. True forecast probability for five sets of random draws from the PDFs in Figure 
27 where each curve is labeled with its associated ensemble mean error, ensemble 
spread error and ensemble spread. The five possible values of true forecast 
probability (marked by dots) for a Pe of 55% are 79.1, 69.6, 52.4, 51.3, and 46.7% 

[After Eckel and Allen 2009]. 
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True Forecast Probability 


Figure 31. Histogram of 50 000 sample values of true foreeast probability for ealibrated 
ensemble foreeast probability of (a) 55.0%, (b) 11.0%, and (e) 94.0% generated 
from random samples from the PDFs in Figure 27. Eaeh histogram is eentered on 
the pe value from whieh it was generated sinee the ensemble foreeast PDFs were 
ealibrated. The 5**' and 95**' pereentile values of true foreeast probability (for use 
in Figure 32) are indieated by and p^^ [From Eekel and Allen 2009]. 
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Calibrated Forecast Probability (%) 

Figure 32. CES ambiguity for all calibrated forecast probability values. After repeated 
sampling, the 5**^ and the 95**' percentiles of the possible true forecast probability 
values (and represent ambiguity as a 90% Cl about the expected true value 

(dashed line) for calibrated pe [After Eckel and Allen 2009]. 
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Figure 33. CES ambiguity for all calibrated forecast probability values using a set ensemble 
spread. Similar to Figure 32 but for specific values of ensemble spread rather 
than all possible values, but still based on the error distributions in Figure 27 (a) 
and (b). The thin (thick) curves show the ambiguity for an ensemble spread of 
2.0°C (6.0°C) [From Eckel and Allen 2009]. 


105 




True Forecast Probability (%) True Forecast Probability (%) 


Figure 34. Ambiguity distributions produced by bootstrap resampling of simulated ensemble 
forecast data (not shown) for (a) An example, perfect 30-member forecast, 
simulated by 30 random draws from the true PDF in Figure 25 and (b) An 
example, perfect 80-member forecast simulated using the same true PDF as in (a). 
The original forecast probability (pj, and (5**' and 95**' percentiles that 

define total ambiguity), and p-j, (true forecast probability) are labeled. Total 

ambiguity values are 17.8% for (a) and 12.4% for (b). Notice that pe ends up as 
the distribution’s central value [After Eckel and Allen 2009]. 
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Error in Ensemble Mean (°C) Fractional Error in Ensemble Soread 

Figure 35. Error distributions of (a) mean error in the ensemble mean and (b) fractional error 
in ensemble spread. The solid lines are the original, uncalibrated error 
distributions for the JM 2-m 5-day temperature forecasts. The dashed lines give 
the reduced error distributions, where the error variance associated with finite 
sampling (for 51-members) has been removed. The reduced error distributions 
are used to draw random calibration coefficients during RCR 
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Figure 36. Example RCR ambiguity distributions using (a) fixed, bulk ealibration on eaeh 
resample and (b) random ealibration on eaeh resample for the JM 5-day 2-m 
temperature foreeast for a single grid point and date. Note that the random 
ealibration produees a wider ambiguity distribution [After Eekel and Allen 2009]. 
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Generate single set of 100 
L96M EoE constituents 


L96M EPS 
Error Statistics 


For each X variable at each 
tau: 


1. Choose test forecast 
probability (1%, 5%, 10%, 20%, 
30%, 40%, 50%, 60%, 70%, 80%, 
90%, 95%, 99%) 

2. Use iterative bisection 
(Section III.D.1) method to 
determine distribution of 
forecast probability values 
with expected vaiue equal to 
chosen test value 


3. Store the 100 constituent 
forecast probability values 

4. Repeat the process for all 
test forecast probability 
values 


f \ 

Output: 100 possible 
true forecast 
probability values per 
variable per test 
probability per tau 

V___ J 
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RCR 



1 


For each X variabie at each 
tau: 


1. Choose test forecast 
probability (1%, 5%, 10%, 20%, 
30%, 40%, 50%, 60%, 70%, 80%, 
90%, 95%, 99%) 

2. Take the first constituent 
from an EoE set as the 
control ensemble forecast 

3. Perform RCR using L96M 
EPS error statistics for 
random calibration 
coefficients 


4. Use iterative bisection 
method to determine 
distribution of forecast 
probability values with 
expected value equal to 
chosen test value 

3. Store the 10,000 
resampled forecast 
probability values 

4. Repeat the process for all 
test forecast probability 
values 



Output: 10,000 possible 
true forecast probability 
values per variable per 
test probability per tau 

V_!_!_ J 


Figure 37. Post-processing steps for ambiguity data for the three estimation techniques. 
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Control Ensemble Members 





Bisection Algorithm: 

1. Using the control ensemble 
forecast, find the max and min value 
of the ensemble members. Add -/+ 
AX to the min/max values to find 
bi-vall andbi-val2. (a) 

Expanding the ensemble range helps 
the algorithm converge on extreme 
probability values. 

2. Find the mean of bi-vall and bi- 
val2 bi-val3. (a) 

3. For all EoE constituent forecasts, 
find P{X > bi-val3), and find the 
expected value of the constituent’s 
probabilities (£[ pj). (b) 

4. Compare E[ ] to the desired 
probability value (p*): 

error = p* -£[pj 

5. If the [error I >£• = 0.0001 and: 

error > 0, set bi-val2 = bi-val3 
- or - 

error < 0 , set bi-vall = bi-val3 (c) 

Return to step 2 and repeat process 
using the new bi-val values, (c) 

6. If [error I < s = 0.0001, the 
algorithm has converged. 


Figure 38. Iterative-bisection method used to converge on the Z-value giving the expected 
value of EoE constituent or RCR resampled values equal to some desired p* 

value. 
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Figure 39. Integrated optimal VS (lOVS) example for the control forecasts at a single forecast 
lead time, (a) The optimal VS is computed using the 800 control forecast 
probability values at r = 2.6 . The positive area under the curve is computed 
using Equation (27) by summing the area of intervals (gray regions) from C/L 0-1 
using a Av of 0.01. (b) The Ay of each interval’s area is the optimal VS at the 
center of the interval (e.g., for the interval 0.51-0.52, Ay is the optimal VS at C/L 
= 0.515). An interval’s area is taken as zero if the optimal FA < 0 . 


no 







Figure 40. Flowchart of decision process for the repeat false alarms secondary criteria 

scenario using the ambiguity distribution overlap. Tallying indicates filling in the 
contingency table (Table 2, page 31) for the current decision rule (C/L). The 
setting of the repeat false alarm flag determines the outcome of the “Previous 
forecast FA” decision point, where a set flag equals Y. 
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Figure 41. Overlap threshold coneeptual model as a function of C/L for the repeat false alarm 

secondary criteria value testing scenario. 



Figure 42. Flowchart for determining empirical secondary criteria overlap threshold value. 

Performed for each C/L, testing compares the metrics derived using the control 
forecast probability versus using the current overlap threshold. 
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Figure 43. Reliability diagrams for raw and calibrated NCEP GEFS forecasts based on the 
training dataset with 102,060 forecast-observation pairs. The reliability diagrams 
for the (a) raw and (c) calibrated data used 11 forecast probability bins (0-0.05, 
0.05-0.15, 0.15-0.25,..., 0.95-1.0) where the average forecast probability with 
each bin is used as the bin’s representative value. Error bars represent the 95% 
binomial Cl (Wilks, 2006). The dashed line indicates perfect reliability, while the 
dotted line shows the sample climatology. The bin usage histograms for the (b) 
raw and (d) calibrated data give the number of forecast probabilities falling in 

each of the 11 bins. 
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Figure 44. Reliability diagram for raw and calibrated NCEP GEFS forecasts based on the 
independent application dataset with 50,220 forecast-observation pairs. Same as 

Figure 43. 
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Figure 45. Sample CES^ NCEP GEES 21-member EPS ambiguity distributions ereated 
using error statistics in Table 7. The histograms show the relative frequency of 
Pj values for p] =15% with cr^ =2 °C (gray) and cr^ =8 °C (transparent). 
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Figure 46. Empirical optimal overlap threshold for reducing repeat false alarms for the event 
2-m temperature < 0°C using the NCEP GEES training dataset. The optimal 
overlap threshold is computed at each C/L from 0.01-0.99 at an increment of 0.01 

(solid line). 
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Figure 47. Comparison of primary value me tries (a) optimal VS, (b) POD and (e) POMD 
used to find the optimal overlap threshold for C/L 0.01. Control scores in all three 
panels are shown by the solid line with error bars representing the 95% CL The 
expected value of metrics using overlap threshold values from 0.5% to 50% at a 
0.5% increment are shown by the dot-dashed line with a circle at each overlap 
threshold value. Arrows indicate the first point where expected value of each 
metric falls within the 95% Cl of the control. The optimal overlap threshold is the 
lowest threshold value where the expected values of all three metrics fall within 
the 95% Cl of the control. In this case, the optimal overlap threshold is 31.5%. 
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Table 4. Climatological Data for L96 System and Model. The 95% Cl about the expected 

value for each statistic is taken as ± values in parenthesis 



System 

Model (Det.) 

Model (Stoch.) 

y 

^ max 

14.76 (0.003) 

14.05 (0.004) 

14.78 (0.014) 


-8.31 (0.011) 

-7.01 (0.012) 

-7.99 (0.011) 

Mx 

3.69 (0.004) 

3.76 (0.004) 

3.73 (0.004) 


4.54 (0.002) 

4.42 (0.002) 

4.43 (0.002) 

Y 

max 

2.46 (0.003) 

- 

- 

Y . 

mm 

-1.8 (0.007) 

- 

- 

My 

0.12 (0.001) 

- 

- 

(Ty 

0.30 (0.001) 

- 

- 
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Table 5. L96M EPS Error Statistics (bulk and variance) at each forecast lead time. 


tau 

Bulk 

ME, 

Variance 

Bulk 

Variance 

MSE. 

e 

Bulk Variance 

Bulk 

t 

<J 

Variance 

0 


0.0036 

0.0335 

2.0E-05 

0.0343 

0.0004 

0.9776 

0.0045 

0.2 

0.0326 

0.0122 

0.1343 

0.0010 

0.1405 

0.0104 

0.9558 

0.0119 

0.4 

0.0476 

0.0274 

0.3523 

0.0228 

0.3931 

0.1391 

0.8963 

0.0324 

0.6 

0.0182 

0.0559 

0.7412 

0.2273 

0.9257 

0.9761 

0.8007 

0.0542 

0.8 

0.0116 

0.0962 

1.3365 

0.9026 

1.7989 

3.3177 

0.7430 

0.0588 

1 

0.0321 

0.1532 

2.0735 

2.1645 

2.9757 

7.9045 

0.6968 

0.0545 

1.2 

0.0265 

0.1957 

2.8735 

3.8459 

4.3806 

15.8611 

0.6560 

0.0459 

1.4 

0.0237 

0.2388 

3.6443 

4.7197 

5.4344 

19.8098 

0.6706 

0.0379 

1.6 

0.0195 

0.2644 

4.3504 

5.2949 

6.4032 

25.1928 

0.6794 

0.0314 

1.8 

0.0314 

0.2855 

4.9484 

5.3256 

7.2733 

28.7206 

0.6804 

0.0249 

2 

0.0406 

0.2944 

5.4895 

5.1879 

8.0469 

34.6424 

0.6822 

0.0200 

2.2 

0.0513 

0.3064 

6.0126 

4.9067 

8.7885 

40.7473 

0.6841 

0.0159 

2.4 

0.0366 

0.3167 

6.4537 

4.7358 

9.3284 

44.7655 

0.6918 

0.0136 

2.6 

0.0416 

0.3128 

6.8608 

4.5559 

9.8513 

47.1874 

0.6964 

0.0115 

2.8 

0.0400 

0.3197 

7.2593 

4.8418 

10.5452 

49.3660 

0.6884 

0.0105 

3 

0.0573 

0.3164 

7.6156 

4.5110 

10.9219 

53.1869 

0.6973 

0.0091 

3.2 

0.0570 

0.3272 

7.9683 

4.8239 

11.4755 

57.6767 

0.6944 

0.0089 

3.4 

0.0503 

0.3319 

8.2575 

4.7032 

11.6946 

58.7195 

0.7061 

0.0084 

3.6 

0.0422 

0.3167 

8.5592 

4.8423 

12.1858 

62.2317 

0.7024 

0.0080 

3.8 

0.0434 

0.3189 

8.8287 

5.1559 

12.6769 

64.4172 

0.6964 

0.0078 

4 

0.0555 

0.3177 

9.0636 

5.0917 

12.9123 

65.4157 

0.7019 

0.0074 

4.2 

0.0627 

0.3266 

9.2992 

5.2461 

13.3640 

70.6218 

0.6958 

0.0071 

4.4 

0.0502 

0.3306 

9.5024 

5.2769 

13.4321 

73.4724 

0.7074 

0.0071 

4.6 

0.0412 

0.3218 

9.7396 

5.4897 

13.8778 

75.5714 

0.7018 

0.0069 

4.8 

0.0404 

0.3116 

9.9273 

5.6479 

14.3054 

79.2468 

0.6940 

0.0067 

5 

0.0570 

0.3210 

10.1377 

5.7393 

14.5251 

79.9435 

0.6979 

0.0066 
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Table 6. Decision rules tested for secondary criteria value. With the exception of the 
“Control,” these decision rules are only applicable following a forecast false 
alarm when the current decision input advises taking protective action 


Name 

Decision Rule 

Control 

User always follows the deeisioti based on the eontrol foreeast probability, 
regardless of the eonsequenees from the weather foreeast. 

Always 

User will always reverse the deeision. 

Random 

User losses eonfidenee and deeides randomly (fair eoin toss) whether or 
not to reverse the deeision. 

Brash 

User understands the eoneept of ambiguity. Instead of using an objeetive 
method to apply ambiguity to the foreeast, the user applies their own 
‘rough estimate’ of the ambiguity to the eontrol foreeast probability to 
avoid repeat false alarms. The ‘rough estimate’ used here is 5%, thus the 

user reverses the deeision when p' - 5% < C/L. 

Overlap Conceptual Model 

User employs the estimated ambiguity distribution to determine the 
overlap. The overlap value is eompared to an overlap threshold 
determined from the eoneeptual model (diseussed in the text). Overlap 
values greater than the threshold result in the user reversing the deeision. 

Optimal Overlap 

User employs the full estimated ambiguity distribution to determine the 
overlap. The overlap value is eompared to an empirieally determined 
overlap threshold (diseussed in the text). Overlap values greater than the 
threshold result in the user reversing the deeision. 


Table 7. NCEP GETS 21-member EPS error statistics used to determine calibration 

coefficients and CESl ambiguity distributions 


ME, 



Average 

Standard 

Deviation 

Average 

Standard 

Deviation 

Average 

Standard 

Deviation 

Training, raw 

-0.0319 

0.767 

0.596 

0.139 

1.78 

0.392 

Training, ealibrated 

0.0 

0.767 

1.0 

0.228 

2.92 

0.641 

Applieation, raw 

-0.156 

0.915 

0.620 

0.175 

2.00 

0.569 

Applieation, ealibrated 

-0.124 

0.915 

1.015 

0.287 

3.27 

0.931 
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Table 8. Partial ambiguity distributions from the NCEP GEFS 21-member EPS CES^ 
tables for />* = 15% using three different ensemble spread values. The table 
eontains the relative frequeney of sample Pj values within a 1% bin from 0% to 
55%, where the upper bound of each bin is provided. 


Bin Maximum 

C 7 =2 °C 

e 

C 7 =4 °C 

e 

II 

oo 

0.01 

0.0099 

0.0007 

0.0001 

0.02 

0.0162 

0.0032 

0.0009 

0.03 

0.0210 

0.0069 

0.0024 

0.04 

0.0247 

0.0112 

0.0050 

0.05 

0.0274 

0.0163 

0.0101 

0.06 

0.0305 

0.0233 

0.0150 

0.07 

0.0330 

0.0255 

0.0221 

0.08 

0.0346 

0.0326 

0.0289 

0.09 

0.0354 

0.0370 

0.0373 

0.1 

0.0374 

0.0425 

0.0432 

0.11 

0.0379 

0.0467 

0.0502 

0.12 

0.0381 

0.0492 

0.0571 

0.13 

0.0413 

0.0535 

0.0630 

0.14 

0.0382 

0.0540 

0.0646 

0.15 

0.0373 

0.0556 

0.0675 

0.16 

0.0376 

0.0539 

0.0659 

0.17 

0.0376 

0.0530 

0.0629 

0.18 

0.0364 

0.0533 

0.0623 

0.19 

0.0340 

0.0507 

0.0566 

0.2 

0.0333 

0.0481 

0.0532 

0.21 

0.0322 

0.0433 

0.0471 

0.22 

0.0295 

0.0401 

0.0401 

0.23 

0.0282 

0.0355 

0.0340 

0.24 

0.0269 

0.0319 

0.0282 

0.25 

0.0248 

0.0278 

0.0222 

0.26 

0.0231 

0.0232 

0.0180 

0.27 

0.0216 

0.0192 

0.0132 

0.28 

0.0202 

0.0161 

0.0094 

0.29 

0.0194 

0.0126 

0.0072 

0.3 

0.0167 

0.0097 

0.0044 

0.31 

0.0144 

0.0070 

0.0032 

0.32 

0.0133 

0.0054 

0.0021 

0.33 

0.0118 

0.0032 

0.0011 

0.34 

0.0107 

0.0028 

0.0008 

0.35 

0.0092 

0.0019 

0.0005 

0.36 

0.0083 

0.0012 

0.0001 

0.37 

0.0078 

0.0007 

0.0001 

0.38 

0.0065 

0.0005 

0.00004 

0.39 

0.0054 

0.0003 

0 

0.4 

0.0044 

0.0002 

0.00002 

0.41 

0.0039 

0.00004 

0 

0.42 

0.0037 

0.00004 

0 

0.43 

0.0028 

0 

0 

0.44 

0.0023 

0.00002 

0 

0.45 

0.0019 

0.00002 

0 

0.46 

0.0014 

0 

0 

0.47 

0.0017 

0 

0 

0.48 

0.0012 

0 

0 

0.49 

0.0011 

0 

0 

0.5 

0.0006 

0 

0 

0.51 

0.0005 

0 

0 

0.52 

0.0007 

0 

0 

0.53 

0.0003 

0 

0 

0.54 

0.0003 

0 

0 

0.55 

0.0002 

0 

0 
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IV. RESULTS 


This chapter presents the results obtained during this researeh regarding the three 
researeh objeetives outlined in the Introduetion. 

A. EVOLUTION OF AMBIGUITY 

This seetion addresses the first researeh goal of understanding the behavior of 
ambiguity throughout the foreeast. This goal is aeeomplished using the EoE. The EoE is 
our best estimate of ambiguity sinee it direetly samples the inherent uneertainties in the 
IC and model perturbations and their sensitivities to a speeifie foreeast seenario. The 
eonstituents’ IC and model perturbations span the range of analysis and model errors, 
giving a distribution of plausible probability foreeasts for a given EPS’s sensitivity to 
errors in the ICs and in the model. The evolution studies were performed using 100 EoE 
foreeast eases seleeted to span the L96M attraetor, eaeh with 100 eonstituents. 

Our original hypothesis eoneeming the behavior of ambiguity regarded the 
magnitude of the varianee of the random errors in the first two moments of the ensemble 
PDE as the primary influenees on ambiguity. Speoifieally, the mean error of the 
ensemble mean ( ME- ) and the fraetional error in ensemble spread (cr') were eonsidered. 

We hypothesized that inereases (deereases) in ambiguity are direetly related to the 
inereases (deereases) in the varianee of the random errors. Errors in the first moment 
play a larger role in creating errors in foreeast probability, thus the varianee of the ME- 
dominates. 

Using the large dataset of 24,000 ensemble foreeasts from the E96M EPS, the 
varianee in ME- and cr' were diagnosed following bulk ealibration of the data to remove 
systematie error. The evolution of the varianee in these two error oharaeteristies is shown 
in Eigure 48 (a) and (b). Early in the foreeast (before r = 0.6), error varianee is low as all 
ensemble members likely exhibit similar high skill. Maximum dispersion in the 
ensemble foreeasts oeeurs on average between r = 0.6 and r = 2.0 (Chapter III.A.4). 
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During this period, the varianee in the ME- error distribution inereases. As ensemble 
spread inereases, the possible error in the ensemble mean inereases sinee the verifieation 
may fall farther from the eenter of the foreeast PDF, thus the varianee in the ME- also 

inereases. The varianee in the fraetional error in ensemble spread inereases as dispersion 
ramps up, but quiekly peaks and begins a gradual deerease to and below its original level. 
At early lead times, the skill of all ensemble members is likely high, resulting in 
eonsistently low ensemble spread and similar fraetional error values between foreeasts. 
As error growth begins to ramp up, some members experienee faster error growth than 
others due to sensitivities to the loeation in the attraetor or defieieneies in the EPS. At 
this point, fraetional error between eonstituents ean vary greatly, and it aseends to its 
maximum varianee. Eventually, high error growth oeeurs in all members resulting in 
similarly high ensemble spread for all eonstituents, whieh reduees the variation in 
fraetional error values, hollowing maximum dispersion, the varianee in ME- levels off 
but remains high due to the large spread in the ensemble PDE. The varianee in fraetional 
error eontinues to deerease and asymptotes towards zero as the ensemble spread similarly 
saturates among all eonstituents. 

Employing the initial hypothesis regarding the behavior of ambiguity, we 
expeeted the following evolution. Early in the foreeast prior to maximum ensemble 
dispersion, ambiguity should be relatively low sinee both error varianees are low. As the 
foreeast moves into the time of maximum dispersion, ambiguity should rapidly inerease 
to a maximum following the inerease in varianee of both errors. However, following 
maximum dispersion, ambiguity was expeeted to deerease and asymptote to zero as the 
foreeast PDE saturates towards elimatology, resulting in no uneertainty in the PDE. 

Using the 100 EoE foreeast eases, we determined the average total ambiguity 
[Equation (26), page 61] as a funetion of foreeast lead time. Eigure 49 shows the average 
total ambiguity for the EoE ambiguity distributions for values of 5%, 50%, and 95%, 

for eomparison. The behavior of ambiguity shown does not follow our initial hypothesis. 
Rather than peaking during the height of error growth, ambiguity maximized early in the 
foreeast period then deereased quiekly during peak error growth. Eate lead time 
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ambiguity did behave as hypothesized exeept it asymptoted to a minimum value and not 
zero. Evidently, the initial hypothesis is in need of revision. 

To explore this behavior in detail, it was beneficial to look at the evolution of 
ambiguity for a single EoE forecast case. We plotted each of the 100 constituents’ 
forecast PDEs for an arbitrary variable using a normal fit to the 21 members in each 

constituent’s ensemble forecast for forecast lead times r = 0.2 through r = 5.0 at an 
interval of 0.2 time units (Appendix). Each figure also displays a histogram of the EoE 
Pj values at each lead time, where the expected value of each distribution is 50%. These 

figures display a time sequence where ambiguity starts out high and decreases with 
increasing lead time with some fluctuation around maximum ensemble dispersion. 

Eor a deeper understanding, we looked more closely at r = 0.2 (i.e., a high 
ambiguity time) and r = 4.8 (i.e., a low ambiguity time) in Eigure 50 (a) and (b). Eor 
analysis of forecast probability in Eigure 50(a), the event threshold resulting in 
E (Pj.) = 50% is A = -1.72 . A wide range of forecast probability values are possible 

using this threshold with each constituent individually. The calculated range of values 
spans from 1% to 98%, with total ambiguity from 7% to 92% (85%). The total ambiguity 
compares well with the average value shown in Eigure 49 for p* = 50%, although the 

width is slightly larger than the average for this particular EoE forecast case and variable. 
Eooking at the later lead time in Eigure 50(b) and using a different event threshold 
(A = 2) that again gives E (Pj,) = 50%, the range of constituent forecast probability 

values is much smaller, spanning 26% to 78% with total ambiguity of 34% (33% to 
67%). Again, this is consistent with the evolution shown in Eigure 49, where ambiguity 
decreases for later lead times. 

Erom this analysis, the primary influences on the size of the EoE ambiguity 
distributions appear to be how much variation is present between the locations of the 
constituents’ PDEs and the uncertainty (i.e., ensemble spread) of the constituents’ PDFs. 
Early in the forecast [Figure 50(a)], the typical spread of each constituent is still quite low 
with an average spread of 0.371. The standard deviation of the constituents’ means is 
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equal to 0.224, which is comparable in size. For a centrally located decision threshold 
like the one chosen here, the constituent PDFs will be dispersed on either side of the 
threshold, but since the variation in PDF location is large or comparable to the average 
spread of the constituent forecasts, the constituent PDFs will cross the threshold to 
varying degrees giving a wide range of forecast probability values. 

Playing the same game with the constituents at the later lead time [Figure 50(b)], 
we see that the standard deviation of the constituents’ means is now equal to 0.737, 
which has increased. As expected, the average constituent spread has increased and now 
equals 3.81, which proportionally is a much greater change than was seen in the increase 
in variation of constituent locations (-200% increase versus -900% increase, 
respectively). For a given event threshold, the percentile location of the threshold within 
each constituent PDF is now much more alike leading to similar, albeit slightly different, 
forecast probability values from each constituent. Thus as the typical spread of the 
constituent PDFs increased without a proportional increase in the variation in PDF 
location, the ambiguity associated with the forecast decreased. 

Figure 51 illustrates the sensitivity of forecast probability to PDF spread and 
shifts in PDF location, where a low spread (thick solid) and high spread (dot-dash) PDF 
are shifted from a mean position of 0.75 to -0.25 while maintaining the same spread. In 
Figure 51(a), the probability of preceding the event threshold (thin solid) for the low and 
high spread PDFs is 15.9% and 35.4%, respectively. Following the shift in location in 
Figure 51(b), the low spread probability is 63.1%, which is a change of 47.2%. The high 
spread probability is 55%, giving a change of 19.6%. The location shift resulted in a 
larger displacement of probability density relative to the event threshold for the low 
spread PDF. Thus shifts (or errors) in location are likely to produce a wider ambiguity 
distribution when ensemble spread is low. 

The same concept applies to event thresholds that are not centrally located. In 
this case though, the forecast probability values for many of the constituent PDFs will 
become more certain (i.e., closer to 0% or 100%) leading to a relatively tighter more 
skewed ambiguity distribution. For example, an event threshold of X =-4 in Figure 

50(a), leads to virtually no ambiguity as all constituent PDFs fall above the threshold. 
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For this research, we made comparisons of ambiguity distributions using event thresholds 
that gave the £ (Pj.) of the constituents equal to specific p] values being tested (Figure 
37, page 108), thus event thresholds used always fell amongst the constituent PDFs. 

The relationship between the variability in constituent PDF location and the PDF 
variance is a major influence on ambiguity. To explain this, we found the variance of the 
constituents’ means and the average variance of the constituents for each of the 100 EoE 
forecast cases. These measures are combined over the datasets in order to compare the 
average variance between constituent means and the average constituent variance at each 
lead time, shown in Eigure 52. When ambiguity is high, the average variance of the 
constituent means is comparable in magnitude to the average constituent variance. As 
forecast lead time increases, there is an increase in both metrics, but the rate of increase 
in each is not proportional. The average constituent variance increases at a much faster 
rate leading to a decrease in ambiguity with increasing lead time. 

Using Eigure 53, we compare the variance information found in Eigure 52, the 
ratio of these two variance values, and the average changes in the EoE total ambiguity for 
different p* values (same as Eigure 49). The variance ratio is computed as the average 

variance in constituent location over the average constituent variance. At the beginning 
of the forecast period, the variance ratio is high indicating that the variation in the 
location of the constituent PDEs is nearly as large as the typical constituent spread. 
During maximum dispersion, there is a rapid increase in the average constituent variance 
(0.137 to 7.28 for an over 5000% increase), while the variance of constituents’ means 
increases much less (0.0505 to 0.397 for an increase of less than 700%), resulting in a 
rapid drop in the variance ratio. This period is accompanied by a 40%-50% decrease in 
total ambiguity for the values shown. Eollowing maximum dispersion, the ratio 

asymptotes to a minimum value (-0.0370) as the average constituent variance continues 
to gradually increase. At this time, total ambiguity asymptotes to a minimum value as 
well, but not to zero. The variance of the constituents’ means still results in ambiguity 
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even though eonstituent varianee saturates towards elimatology, sinee shifts in the 
eonstituents’ PDF loeations play a large role in ehanging the foreeast probability value 
assoeiated with eaeh eonstituent. 

This analysis furthers our understanding of the relationship between foreeast 
uneertainty and ambiguity, while solidifying the relationship between ambiguity and 
random errors in the ensemble PDF due to unaeeounted for sourees of uneertainty. We 
determined that ambiguity is elosely linked to both foreeast uneertainty (i.e., first-order 
uneertainty) and the sensitivity of the EPS to defieient IC and model perturbations, in that 
the interaetion of these two faetors eontrols the magnitude of the total ambiguity 
assoeiated with a given foreeast situation for a partieular EPS. 

Reeall that EoE is an impraetieal approaeh to estimating ambiguity due to the 
large eomputational expense. Can ambiguity be estimated without the EoE using 
statistieal eharaeteristies of the EPS’s ensemble foreeasts? booking at Eigure 52 and 
Eigure 11 (page 87), the average varianee of the EoE eonstituents is the same as the 
average varianee of the E96M ensemble foreeasts taken over the large foreeast dataset. 
As a proxy for the varianee of eonstituents’ means, the varianee of the mean error in the 
ensemble mean {ME-) found using the E96M ensemble foreeasts may be used. This 
relationship is shown in Eigure 54 (similar to Eigure 53). The time evolution of average 
ensemble foreeast varianee and varianee of ME- follow the same behavior as seen in 

Eigure 53. The varianee ratio (taken as ME- varianee over average ensemble varianee) 

indieates a similar behavior as well, but note the greatly redueed ratio value early in the 
foreeast when using the EPS error statisties. A eomparison of the ratio values from 
Eigure 53(b) and Eigure 54(b) is shown in Eigure 55. Sinee the average varianee of the 
ensemble foreeasts and the EoE eonstituents are the same, any difference between the 
ratio values must be due to the variance in ME-. In this case, the variance in ME- is not 

large enough to accurately simulate the variation in possible ensemble PDE locations 
found using the EoE (i.e., possible realizations of the ensemble PDE given limitations in 
the EPS perturbations). Thus, ambiguity estimates obtained using the EPS error 
characteristics may be greatly underestimated. After maximum dispersion, the ratio nears 
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a value of one, indieating ambiguity estimates created using the EPS error characteristics 
may improve. This problem may be due to the sub-setting used to arrive at the variance 
in ME- (see Chapter III.B.3). Attempts to use the variance of errors in the ensemble 
mean without sub-setting (i.e., finding the variance of the individual ensemble mean error 
values without averaging) gave an extreme over-estimate of ambiguity at all lead times, 
since the variance of possible error in the ensemble mean value is larger than the average 
variance of the ensemble forecasts at all lead times. 

B. VALIDATION OF AMBIGUITY ESTIMATES 

The discussion of CESq and RCR ambiguity estimate validation in this section 
refers primarily to the series of comparisons shown in Eigure 56 and Eigure 57. Each 
panel in Eigure 56 shows comparisons across all forecast lead times for a specific p* 

value. Alternately, each panel in Eigure 57 provides comparisons across all tested p] 
values for a certain forecast lead time. In both figures, the set p] value or forecast lead 

time used to create each individual panel is displayed at the top of the panel. All 
comparisons show the difference in total ambiguity [Equation (26), page 61] of both the 
CESq and RCR ambiguity distributions compared to the EoE ambiguity distribution, 
where a negative difference indicates the CESq or RCR ambiguity distribution is too 

narrow compared to EoE. This validation strategy gave us a look at how well the 
practical ambiguity estimation techniques simulate the variance of our best estimate of 
the ambiguity distribution. 

Erom Eigure 56 and Eigure 57, we see that the ambiguity distributions from the 
practical estimation techniques appeared to perform very poorly at early in the forecast 
with total ambiguity differences near 30%, but each showed improvement with increased 
forecast lead time. Although this feature may appear to be tied to forecast lead time, it is 
actually tied to the ensemble variance, which plays a significant role in the production of 
ambiguity. 
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Figure 58 shows that the CESq and RCR ambiguity distributions generally 

followed the same evolution as the EoE estimate, where all of the estimates had relatively 
large ambiguity at early times that deereased with time. The exeeption may be RCR, 
where the total ambiguity began to inerease again following a period of deerease. Erom 
Eigure 11 (page 87), as expeeted, we see that ensemble varianee inereased on average 
with inereasing lead time. This ean also be seen in Eigure 59 (a) and (b), where the 
number of uneertain foreeasts (i.e., foreeasts with p] between 0.1% and 99.9%) 

inereased with time, indieating that fewer foreeasts existed where the event threshold fell 
outside of the foreeast PDE. Even though, the ratio in Eigure 55 shows that the varianee 
of the ME- error distribution was highly underdone early in the foreeast (by a faetor of 

six), the CESq and RCR distributions still exhibited maximum ambiguity early in the 
foreeast (Eigure 58), as a result of the generally low ensemble varianee. 

This analysis suggests that ambiguity will evolve from high values to low values 
on average as a result of the typieal inerease in ensemble spread with time. Of eourse, it 
is possible on a ease-by-ease basis for an ensemble foreeast to exhibit small spread at any 
lead time resulting in large ambiguity assoeiated with the foreeast probability. Thus total 
ambiguity is not neeessarily a funetion of foreeast lead time, but rather depends strongly 
on the spread of the eurrent ensemble foreeast at the lead time in question. 

Erom the panels in Eigure 57, we see that the largest differenees in total ambiguity 
oeeurred with mid-range foreeast probability values. In the first panel (r = 0.2), the 
differenee in total ambiguity for p] = 50% was almost 30%, while the differenees for 

both pI =1% and =99% were between 4%-7%. Thus it may appear the CESq and 

RCR estimates performed better for extreme foreeast probability values. The apparent 

disparity in performanee is simply a result of the lower and upper bounds (i.e., 0% and 

100%, respeetively) eonfining the range of possible foreeast probability values. In 

general, we expeet to see tighter ambiguity distributions for the extreme foreeast 

probability values. Consider a single set of 100 EoE eonstituents. An event threshold 

that results in an expeeted value of 1% for the EoE ambiguity distribution (using 

probability of exeeeding) will likely fall above (i.e., to the right) of many of the 
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constituent’s PDFs resulting in foreeast probabilities very elose to or equal to 0% for 
those eonstituents. In this ease, the ambiguity distribution is tighter sinee it is bounded 
on the low end, where many near 0% foreeast probabilities aceumulate. An event 
threshold giving an expeeted value of 50% for the ambiguity distribution for the same 
EoE foreeast ease produees a wider ambiguity distribution sinee the plaeement of the 
threshold is eentrally loeated among the eonstituent PDEs allowing foreeast probability 
values to spread evenly on either side of 50%. 

Therefore, problems with the varianee of CESq or RCR ambiguity distributions 
were seen on both sides of the distributions as shown in Eigure 60. The figure shows 
EoE and CESq ambiguity distributions eomputed from a single EoE foreeast ease for the 

same variable, both eentered on p* = 50% (in aoeordanee with the validation method) at 
r = 5 . The CESq distribution was too narrow (20% versus 32% total ambiguity), and 
the total ambiguity differenee when eompared to the EoE distribution was equivalent on 
either side at 6%. In eontrast, Eigure 61 shows the EoE and CESq ambiguity 

distributions for />* = 5% using the same EoE foreeast ease and variable at the same 

foreeast time. Here, the differenees between the distributions were ehiefly present in the 
direetion of higher foreeast probability values. Both of the distributions are bounded by 
0% on the low side, whieh resulted in a similar value (approximately 2%) for the lower 
bound of the 90% Cl for each estimate. Thus the differenee in total ambiguity was 
essentially one-sided, where the upper bounds are 9% and 12% for CES and EoE, 
respeetively. Although we may be eneouraged by the results for extreme foreeast 
probability values, it is important to understand that this improvement is in part an 
artifieial result. 

We found the CESq total ambiguity to be too narrow in relation to the aggregated 
EoE ambiguity distributions regardless of foreeast lead time or foreeast probability value 
tested. A leading eontributor to this problem was the ereation of CESq ambiguity 

distributions using random draws from the distribution of average ensemble varianee, 
thus ensemble varianee was independent of the foreeast situation. Therefore, the typieal 
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ensemble varianee used to eompute the Pj values was near the average, whieh in many 

eases would likely be too high eompared to the flow-dependent varianee. Sinee larger 
ensemble varianee produees a more narrow ambiguity distribution, the eonfiguration of 
CESq will likely result in a eonsistent underestimation of the total ambiguity. It is likely 

that the flow-dependent CES^ estimates would alleviate mueh of this problem, but reeall 
that this teehnique was unavailable when the validation study was performed. 

Additionally, from Eigure 55, the varianee of the ME- error distribution used to 
develop the CESq ambiguity distributions was not wide enough to adequately simulate 

the varianee in foreeast PDE loeation typieally found using the EoE eonstituent foreeasts. 
This defieieney was partieularly severe at the early foreeast lead times prior to maximum 
dispersion, where the varianee of the ME- distribution was as mueh as six times lower. 

So, even if the ensemble varianee was eorreetly simulated, the CESq sample foreeast 
distributions would not be suffieiently separated to produee a wide enough ambiguity 
distribution. Thus early in the foreeast, the eombined problems of using foreeast- 
independent ensemble varianee and largely underdone ME- varianee resulted in large 

differenees in total ambiguity, where the defieieney in the ME- varianee was likely the 
dominant faetor. 

ME- varianee improved to less than a faetor of two differenee later in the foreeast 

following maximum dispersion, performing best towards the end of the period of 
maximum dispersion (ratio value was approximately 1.35). At this point, we found the 
best performanee in CESq total ambiguity, but the total ambiguity was still too small, 

likely beeause the ME- varianee was slightly too low and beeause we did not aeeount for 
flow-dependent ensemble varianee. hollowing maximum dispersion, the ratio value 
began to inerease slightly indieating that the ME- varianee was performing worse, but 
the slow inerease did not eontinue beyond five time units, and the ratio value never 
inereased past 1.7. The slow deterioration of the ME- varianee and the eontinued 
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increase in average ensemble varianee over this period resulted in narrowing of the 
CESq ambiguity distributions and a slow inerease in total ambiguity differenee, whieh 
tapered off near five time units. 

We attempted to use the ratio value from Figure 55 to improve CESq total 
ambiguity by inereasing (i.e., eorreeting) the ME- variance at eaeh lead time by its 

respeetive ratio value. Results showed improved total ambiguity at all lead times, but the 
eorreetion faetor eaused overeorreetion early in the foreeast and was still too small later 
on (example shown in Figure 62 for p* = 50% ). The CESq estimates were likely still 

degraded due to the laek flow-dependenee, thus this line of researeh was not pursued 
further. 

From Figure 56 and Figure 57, the RCR total ambiguity was too narrow during 
the early foreeast lead times, but then transitioned to beeome slightly too wide later in the 
foreeast for most of the p] values tested. Sinee the RCR distributions are flow- 
dependent, we find more evidenee that the highly defieient varianee of the ME- error 
distribution early in the foreeast played a signifieant role in degrading the CESq and 
RCR ambiguity distributions. During this timeframe, the RCR PDFs eould not 
adequately separate to generate suffieient ambiguity eompared to the FoF beeause of the 
poor ME- varianee. 

As the performanee of the ME- error distribution began to reeover, the total 
ambiguity differenee for RCR, like CESq , improved as well. Unlike CESq , the RCR 

estimates showed eontinued improvement beyond maximum dispersion, eventually 
beeoming too wide, but by no more than 3% compared to EoE. As foreeast error growth 
inereased, the average varianee of eaeh of the EoE eonstituents followed, thus deereasing 
the width of the EoE ambiguity distributions. The RCR ambiguity estimate used only the 
first eonstituent of a given EoE foreeast ease, where the varianee of the eonstituent’s 
foreeast PDF was varied for eaeh resample based on random draws from the cr' error 
distribution. In general, the varianee of any resampled PDF would be similar to the 
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average eonstituent varianee, but over 10,000 resamples, there were likely many 
fraetional error draws resulting in a relatively more narrow PDF (eompared to the 
average EoE eonstituent varianee), which inevitably produced a wider ambiguity 
distribution. Thus the total ambiguity difference between RCR and EoE switched from 
negative to positive values for later forecast lead times. 

In the previous discussion of the CESq and RCR ambiguity distributions, we 

have not yet made judgments about the validity of the estimates. In order to make a 
judgment, we first consider the use of EoE ambiguity distributions as the standard. EoE 
provides a flow-dependent ambiguity estimate that accounts for finite ensemble size and 
samples the sensitivity of the probability forecast to deficient analysis and model 
perturbations in the EPS. Analogous to the single ensemble forecast providing the best 
guess for uncertainty in the deterministic forecast, the EoE gives us our best-guess 
estimate of the uncertainty in the ensemble forecast. However, EoE suffers from the 
same basic limitations as an EPS. Eimited sampling due to the finite number of 
constituents results in random error in the EoE ambiguity distribution. Also, any 
incomplete perturbations (simulating EPS deficiencies) in the EoE design will result in 
systematic underestimation of ambiguity. 

It is obvious that deficiencies exist on average in both CESq and RCR, especially 
early in the forecast when ambiguity is the highest (i.e., when ensemble variance is 
typically low). The total ambiguity estimates from CESq and RCR improve with time 

and draw fairly close to the EoE value (generally < 10% and < 5% difference for CESq 
and RCR, respectively) during the timeframe of highest error growth rate between 
r = 0.8 and r = 3.4. 

Erom basic chaos and ensemble forecasting theory, we understand that nonlinear 
error growth limits predictability making the ensemble forecast the best source of forecast 
information in general. However, considering only the deterministic NWP forecast may 
still be appropriate early in the forecast period while average error is below about 10% of 

the climatological variance (cTc ) the deterministic realm), which on average occurs 
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between r = 0.6 and r = 1.0 for the L96M EPS. In Figure 59, prior to maximum error 
growth, the frequeney of uncertain forecasts is low due to the low ensemble variance 
found in the early forecast period. Therefore, at early forecast lead times when the CESq 

and RCR ambiguity estimates are performing at their worst, their deficiencies are not 
critical since forecast uncertainty is not prevalent (i.e., ambiguity is not or rarely needed). 

The rate of error growth is dependent on the scale of the forecasted phenomenon, 
where faster error growth is generally observed for smaller scale phenomena. This scale- 
dependency impacts the ambiguity distributions in the same fashion. The EPS error 
distributions and the ensemble spread statistics are also phenomena-dependent, meaning 
that the ambiguity distributions are also tied to the forecast error growth. So, regardless 
of the forecast variable, total ambiguity estimates from CESq and RCR will evolve from 

high to low values but on different variable- or scale-dependent time scales, producing 
reasonably accurate estimates of total ambiguity past the initial deterministic realm. 
Therefore, we conclude that CESq and RCR ambiguity distributions are likely good 
enough to provide valuable information to the decision process. 

This conclusion should be tempered to apply to situations where the expected 
values of the CESq or RCR ambiguity distributions are equal to or near the expected 

value of the EoE ambiguity distribution, per our experiment design. In general, the 
calibrated forecast probability is merely a random sample from the EoE ambiguity 
distribution, thus it may fall anywhere within the distribution. Since the calibrated 
forecast probability is also the expected value of the CESq and RCR ambiguity 

distributions, the estimated distributions are often not collocated with the EoE ambiguity 
distribution. This issue is discussed in detail in the next section. 

In the following sections, we discuss the results of value studies that incorporated 
the CES and/or RCR ambiguity information into the decision making process. For these 
studies, the question of ambiguity estimate validity becomes a question of whether or not 
the ambiguity information adds value. For example, even if we show that the difference 
in total ambiguity between RCR and EoE is large for some forecast situation, the RCR 
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ambiguity information may still positively influence the decision making process over the 
long-term and add value, while on a case-by-case basis the results will vary due to 
deficiencies in the estimation process. 

C. VALUE USING UNCERTAINTY-FOLDING 

In this section, we assess improvements to value from the uncertainty-folding 
technique. Recall that we used two separate event thresholds designed to represent a 
common and a rare event. Uncertainty folding was performed using ambiguity 
distributions from the EoE, CESq and RCR estimation techniques to provide values 

for each method. In addition, a grand ensemble was tested where all constituent members 
for a single EoE forecast case were combined to form a large ensemble giving a single 
forecast probability value (). These four decision input sources were compared 

relative to the value provided by basing decisions on the control ensemble forecast 
probability alone. The control ensemble forecast was taken as the first constituent of 
each EoE forecast case. Significance of the results in this section was assessed using the 
95% Cl for the results produced by resampling. 

To check if the E96M EPS control forecast was behaving well with respect to 
value, we first verified that its forecast probability was outperforming the deterministic 
forecast. If not, the deterministic forecast would be more appropriate to use in decision 
making, and ambiguity about the forecast probability is irrelevant. If the control forecast 
probability does add value compared to the deterministic forecast, then our uncertainty¬ 
folding results will show if any additional value can be added by incorporating the 
ambiguity information. Eor this comparison, we computed the integrated optimal VS 
[lOVS, Equation (27), page 72] for both common and rare event thresholds for the 
deterministic and control ensemble forecasts, displayed in Eigure 63 (a) and (b), 
respectively. The deterministic forecast was taken as the first member of the first 
constituent in each EoE forecast case. In both figures, we see that the control ensemble 
forecast provided significantly better value than the deterministic forecast, except at very 
early forecast lead times when the deterministic skill was still high. 


136 



For both events, the relative lOVS found using the EoE values and the grand 
ensemble’s values was generally greater than one throughout the foreeast as shown in 

Eigure 64 (a) and (b), indieating that these two sourees provided additional value 
compared to the control forecast. We found the improvement to be significant past 
r = 1.4 for the common event. Eor the rare event, the improvement was only significant 
at sporadic lead times. 

At most lead times, the grand ensemble appeared to provide slightly better value 
than the EoE data. The scores for these two methods started close to one and then 
increased during the time of maximum dispersion. At the beginning of the forecast, the 
skill associated with the control ensemble forecast was still quite high, thus it was 
difficult for the grand ensemble or EoE to improve on the value attained by the control. 
As the forecast dispersion and error growth ramped up, the skill of the control ensemble 
decreased, and the grand ensemble and EoE were able to provide greater value due to the 
additional information available in each method. 

Each grand ensemble was a collection of 2,100 ensemble members where IC and 
model perturbations were varied within the range of uncertainty. Thus the grand 
ensemble accounted for deficiencies in the modeling system much more thoroughly than 
a single 21-member ensemble forecast. The EoE ambiguity distribution was able to 
provide additional value for the same reason, since it incorporated each constituent’s 
simulation of uncertainty in the EPS perturbations. The grand ensemble appeared to 
marginally outperform the EoE (although not significantly) since information may have 
been lost during the conversion of each EoE constituent to a single Pj value. Prior to 

computing Pj , each constituent ensemble forecast contained information regarding the 
current first-order uncertainty (i.e. spread), as well as higher-order moments of the 
forecast PDE. This information was lost when a single pj value was used to estimate the 
event uncertainty, and then combined with the other 99 estimates. The grand ensemble 
on the other hand retained all information when making its single estimation of the event 
uncertainty. 
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The scores found using uncertainty-folding with the CESq and RCR values 

were generally not significantly different than one throughout the forecast for both the 
common and rare event, indicating that they performed on par with the control forecast. 
In the validation section, we saw that both techniques did a reasonable job of estimating 
ambiguity. Their lack of value here can be explained by considering how their ambiguity 
distributions were produced. Both techniques’ ambiguity distributions were centered on 
the control ensemble forecast’s (i.e., the first constituent from an EoE forecast case) p] 
value. Based on the uncertainty-folding computation, both the CESq and RCR p^ 
values should remain close to p]. Thus the value attained using the practical ambiguity 
estimates is unlikely to be significantly different from that of the control ensemble 
forecast. 

Additionally, is a random sample from the EoE ambiguity distribution for a 

certain forecast case, thus it could fall anywhere within the EoE ambiguity distribution. 
We performed validation by artificially locating event thresholds where the expected 
value of the EoE ambiguity distribution was equal to , thus collocating the CESq and 

RCR ambiguity distributions with the EoE ambiguity distribution. Therefore, validation 
only provided a measure of how well the estimation techniques matched with respect to 
the variance of their respective ambiguity distributions. 

Eigure 65 shows a situation where p* was collocated with the expected value of 
the EoE distribution using a single EoE forecast case at r = 4 , where the RCR ambiguity 
distribution was shown to provide a reasonably good ambiguity estimate. The 100 EoE 
constituent Pj. values were histogrammed using class interval of 1%. Eor clarity in the 

figure, the 10,000 RCR Pj values were fit using a beta distribution. Although the beta- 
fit does not always provide a quality fit to the Pj data, it was sufficient for the 
pedagogical purpose here. Eor this forecast case, the total ambiguity of the EoE and RCR 
distributions appeared to match well as was expected (90% Cl widths for EoE and RCR 
are 26% and 31%, respectively). Since the RCR distribution was collocated with the EoE 
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distribution, both estimations gave similar values (78.2% and 78.8%) when used with 
uneertainty-folding. In Figure 66, we show a different ease at the same foreeast lead time 
where the RCR value oeeurred in the upper tail of the EoE ambiguity distribution. 
The total ambiguity of the two distributions was still relatively elose (27% and 34%), but 
there is a large differenee between the values (19.5% and 36.4%). 

Erom this analysis, we see that while the praetieal estimation teehniques were 
fairly effeetive at simulating the varianee of the ambiguity distribution, differenees 
should typieally exist between the EoE and the CESq and RCR p^ values sinee p* is a 

random sample within the EoE ambiguity distribution. These differenees produee errors 
when using the estimates to eompute a single deeision input eombining the first- and 
seeond-order uneertainty, redueing the value of the deeision input in normative deeision 
making. On the other hand, the theoretieal and impraetieal EoE ambiguity estimate was 
able to add signifieant value to the deeision making proeess, sinee its p^ value is not tied 

to the eontrol foreeast probability. Additionally, while eaeh of the estimation methods 
produees eonsistent estimates of the ambiguity, EoE provides a sharper distribution 
eliminating bogus Pj possibilities, resulting in a better p^ value. 

D. VALUE USING SECONDARY CRITERIA 

This seetion deseribes our experiments using the ambiguity information to add 
value to the deeision making proeess when eonsidering the seeondary eriteria of repeat 
false alarms. Our goal was to use the ambiguity information to signifieantly reduee the 
number of repeat false alarms while maintaining the primary value (measured by optimal 
VS, as well as POD and POMD) assoeiated with normative deeision making within the 
C/L seenario. To alter the seeondary eriteria (i.e., reduee repeat false alarms), a user was 
allowed to reverse the eurrent deeision of taking proteetive aetion if and only if a false 
alarm had just oeeurred at the same loeation. We eompared the primary value and 
seeondary eriteria results for various possible user deeision rules (Table 6, page 120) to 
evaluate the effeetiveness of eaeh. The experiment was performed using real-world 
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forecast data as described in Chapter III.F. We used the 95% Cl found through 
resampling to assess the significance of results among user. 

Prior to exploring the secondary criteria, we first evaluated the performance of 
GEFS in relation to the deterministic forecast (member #1). As described previously, 
forecast error growth was still quite low and barely out of the deterministic realm at 120- 
hours (Chapter III.F.4), thus we needed to determine if GEFS was adding value compared 
to the deterministic forecast at this time. From Figure 67, the deterministic forecast 
provided value for a large range of C/L (10% to 85%), but GEFS added significant value 
over the deterministic forecast, plus, it provided value over a greater range of C/L (1% to 
91%). Thus it made sense to use the ensemble forecasts since we were at a lead time 
where the ensemble was adding significant value over the deterministic forecast. We 
were primarily concerned with the value in secondary criteria that could be added to users 
with low C/L since they experience frequent false alarms. At low C/L, the opportunities 
for false alarms are numerous since there are many forecasts directing the user to protect 
(have low p] and result in a non-occurrence of the event). Alternately, high C/L users 
generally see fewer false alarms so may be less concerned with their repeats. 

The number of repeat false alarms found following GEFS with each C/L is shown 
in Figure 68. As expected, there were a large number of repeat false alarms for the 
extremely low C/L values, because there were many forecasts that required the user to 
protect. As the C/L increased, fewer false alarm opportunities were available. The 
fastest rate of decrease in the number of repeat false alarms occurred between the C/L 1% 
and 5%. From Figure 44(b) (page 114), approximately 40% of all forecast probability 
values from the 50,220 forecasts fell within the 0%-5% bin. Accordingly, once the C/L 
increased beyond 5%, a large portion of the forecast opportunities would direct the user 
to take not protect, greatly reducing the overall number of false alarm opportunities. The 
rate of decrease slowed as C/L increased, but the number of repeat false alarms never 
reached zero, even for the highest C/L of 99%. Since ensemble spread was still relatively 
low at the forecast lead time, many of the control probability forecasts were close to 0% 
and 100% (i.e., forecast res was high). Approximately 23% of the forecasts fell within 
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the 95%-100% bin, as seen in Figure 44(b). Thus due to the large number of high 
probability foreeasts, there were still false alarm and repeat false alarm opportunities, 
even for the highest C/L values. 

The VS results for the always user (Table 6, page 120) and the eontrol user are 
compared in Figure 69. Obviously, the number of repeat false alarms for the always user 
was zero at all C/L (i.e., 100% reduction), and the change was significant. However, 
choosing to always avoid repeat false alarms severely degraded the primary VS, because 
many of the reversals resulted in additional misses (e.g., for C/L 1%, total misses were 
increased from 35 to 1767). Trading false alarms for misses can severely degrade the VS 
for low C/L users, due to the large change in expense {C <^L ). Thus it would take many 
correct reversals (i.e., false alarm to correct rejection) to account for one incorrect 
reversal (i.e., hit to miss) (Table 3, page 32). Making an incorrect reversal will have a 
much smaller effect on the VS for high C/L users, since C « L and the total expense will 
not be increased greatly. Therefore, changes to the VS will typically be insignificant for 
high C/L users following the always decision rule, as seen in Figure 69. 

The VS for the always user was significantly reduced compared to the control 
over the C/L range 1% to 70%. Beyond C/L 70%, the difference in VS was not 
statistically significant, but the change in our other primary value metrics {POD and 
POMD) was significant through C/L 90% (e.g., POD shown in Figure 70). For C/L 
greater than 90%, there was no significant difference between the control user and the 
always user, but at these C/L, false alarms are typically not a concern (as discussed 
above). We found that this user provided the most significant reduction in our secondary 
criterion, but also the greatest degradation in primary value. 

Results for the random decision rule are shown in Figure 71 and Figure 72. This 
uninformed user who based the decision to reverse his protective action on a coin toss 
was also able to significantly reduce repeat false alarms for all C/L. However, the 
primary value metrics indicated that the random user’s decision strategy was also 
significantly reducing the primary value. Specifically, the VS was significantly lower 
over the C/L range 1% to 59%, while the performance based on POD and POMD was 

significantly different through C/L 80% (e.g.. Figure 73). 
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The percent reduction in repeat false alarms at each C/L was approximately 60% 
(Figure 74), which was greater than our anticipated amount of 50% (i.e., over many cases 
the option to change should occur in approximately half of the opportunities). This was 
an indication that the decision rule was breaking up series of repeat false alarms. For 
example, consider a specific grid point that had three false alarms in a row, resulting in 
two repeat false alarms events counted at that point. If the random user reversed the 
decision for the second false alarm, then both of the repeat false alarm events would be 
eliminated. 

The brash user, who applied a fixed decrease (i.e., 5%) to the control forecast 
probability to mitigate repeat false alarms, was surprisingly able to achieve primary value 
scores similar to those associated with the control user (Figure 75) at all C/L. 
Furthermore, the brash user significantly reduced the number of repeat false alarms for 
two C/L ranges, 1% to 9% and 95% to 99% (Figure 76). The percent reduction from 
Figure 74 for the lower C/L range decreases from 32% to approximately 11%. The 
reduction then fluctuated between 5% and 10% for mid-range C/L before dramatically 
increasing once again for C/L above 90%. The larger reductions for the very low C/L 
values were mainly due to the large proportion of forecasts (-43%) found between 0% 
and 10%, which resulted in more chances to reverse the decision. For the second range 
of C/L (95% to 99%), the percent reduction was 100% (Figure 74). Since the brash user 
always decreased the forecast probability by 5% for repeat false alarm opportunities, 
there were no repeat false alarms for C/L > 95% (i.e., forecast probabilities greater than 
95% were always reduced to 95% or less), which mimicked the always decision rule. 

If we increased the brash user’s arbitrary percent decrease to forecast probability, 
we would see wider ranges of significantly higher percent reduction at both C/L extremes 
due to the same effects described above. However, the brash user’s primary value would 
be significantly reduced for the extreme low C/L if the arbitrary reduction is too large. In 
other words, increasing the brash user’s percent decrease takes him closer to behaving 
like the always user, who clearly failed to maintain primary value. 
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We now turn to the decision rules where the estimated CES^ ambiguity 

distribution was employed to reduce to the number of repeat false alarms. The decision 
to reverse the protective action for repeat false alarm opportunities used a variable 
overlap threshold (function of C/L). If the overlap exceeded the threshold, the decision 
was reversed and no protective action was taken. The results for the conceptual model 
user (Figure 77 and Figure 78) reveal a significant decrease in the secondary criteria at all 
C/L, but an inability to maintain all the primary value. The VS was significantly lower 
from the control’s VS only for C/L 1% to 4%, but the POD and POMD indicated a 
significant difference through C/L 12% (Figure 79). 

Recall that the optimal overlap threshold was designed to find overlap threshold 
values that reduced repeat false alarms while maintaining the primary value metrics 
(Figure 80 and Figure 81). The optimal user was able to match the control user for the 
VS, POD and POMD metrics for all C/L, while also realizing an impressive improvement 
in secondary criteria. 

The percent reduction in repeat false alarms (Figure 74) for the conceptual model 
and the optimal user indicated that both decision strategies improved as C/L increased 
(i.e., percent reduction increased). As C/L increased, the number of repeat false alarm 
opportunities decreased, thus any reversal comprised a larger proportion of the available 
opportunities. The conceptual model had significantly fewer repeat false alarms through 
C/L 12%, but since this decision rule degraded primary value over the same C/L, it was 
not superior over this range of users. The conceptual model employed relatively small 
overlap thresholds (Figure 41, page 112) compared to the optimal user (Figure 46, page 
116) for the low C/L values (e.g., for C/L 1%, 0.5% versus 31.5%, respectively). Given 
the large number of false alarm opportunities for the low C/L, the conceptual model 
resulted in many more cases where the decision to protect was reversed, which lead to an 
increase in expensive misses. 

Beyond C/L 12%, there was no significant difference in the percent reduction of 
repeat false alarms between the conceptual model and optimal users. Although the 
difference was insignificant, there was a crossover point (C/L = 57% ) where the expected 
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value in percent reduction for the optimal user began performing better than the 
conceptual model’s expected value (i.e., fewer repeat false alarms for the optimal user on 
average). Since the conceptual model overlap threshold increased with increasing C/L, it 
allowed fewer decision reversals than the decreasing optimal overlap threshold. 

As described in Chapter IV.C, the CES^ ambiguity estimate used during this 
experiment unavoidably suffered from errors in the location of the ambiguity distribution 
as a result of being centered on the control forecast probability. While the CES^ 

distribution likely provided a robust estimate of the variance of the ambiguity 
distribution, the range of possible forecast probability values may be shifted (compared to 
the ambiguity distribution from EoE). The shift in the CES^ ambiguity distribution was 

random since the control forecast probability is a random sample from the EoE ambiguity 
distribution. Thus there are random errors in the amount of overlap in cases where the 
decision is unclear, resulting in sub-optimal application of the ambiguity information. 
However, even with this deficiency, CES^ clearly added value to the secondary criteria. 

The results attained during this study clearly show the value of employing an 
estimate of the ambiguity associated with the ensemble forecast. The decision rules 
explored above provided evidence that mere random or arbitrary reversals of the decision 
for repeat false alarm opportunities were inferior to reversals made by intelligently 
applying the ambiguity estimate (even if the estimate was flawed) only when the decision 
was unclear. Moreover, we were able to train our decision process based on past 
performance to optimally select an overlap threshold at each C/L to maintain primary 
value while significantly adding value to our secondary criteria. 
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Figure 48. Evolution of L96M EPS error varianee for (a) mean error of ensemble mean and 
(b) fraetional error in ensemble spread. The error varianees are shown following 
ealibration to remove systematie error. 



Eigure 49. Average total ambiguity of the EoE ambiguity distributions for test foreeast 

probability values 5% (o), 50% (*) and 95% (x). 
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Figure 50. Arrangement of EoE constituents at a (a) high and (b) low ambiguity timeframe. 

The PDEs for 100 constituents in a single EoE forecast case are displayed using a 
normal fit (solid lines) for (a) r = 0.2 and (b) r = 4.8 time units. An arbitrary 
event threshold (dashed line) is also shown for analysis of forecast probability 
values for each constituent. Note that in (b) a different event threshold is used, 
and abscissa and ordinate scaling has changed. 
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Figure 51. Example of forecast probability sensitivity to PDF spread and shifts in PDF 

location for low spread (thick solid) and high spread (dot-dash) PDF. In (a), both 
PDFs are located at 0.75, and the probability of preceding the event threshold 
(thin solid) is 15.9% and 35.4% for the low and high spread PDFs, respectively. 
In (b), each PDF is shifted to -0.25 while holding spread constant, giving 
probability values of 63.1% and 55% for the low and high spread PDFs, 
respectively. Probability for the low spread PDF changed by 47.2%, while the 
change was 19.6% for the high spread PDF. 



Forecast Lead Time (non-dim) 

Figure 52. Comparison of average variance between EoE constituent ensemble forecast 
mean values (a) and average variance of EoE constituent ensemble forecasts (■) 
with increasing lead time. The comparison was made using 100 EoE forecast 
cases each containing 100 constituent ensemble forecasts. 
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□tal Ambiguity (%) Variance Ratio Variance (units^) 




Figure 53. Comparing the average evolution of EoE eonstituent relationships to the typieal 
EoE ambiguity evolution using (a) same as Eigure 52, (b) the ratio of average 
varianee in loeation of EoE eonstituent ensemble foreeasts’ means to average 
eonstituent varianee and (e) same as Eigure 49. 
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□tal Ambiguity (%) Variance Ratio Variance (units^) 




Figure 54. Comparing the evolution of average L96M ensemble foreeast statisties to the 
typieal EoE ambiguity evolution using (a) the varianee of mean error in the 
ensemble mean (a) and average ensemble foreeast varianee (■) eomputed from 
24,000 L96M foreeast oases, (b) the ratio of the varianee of the mean error in the 
ensemble mean to the average ensemble varianee in looation and (o) same as 

Eigure 49. 
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Forecast Lead Time (non-dim) 


Figure 55. Ratio of average varianee of EoE eonstituent ensemble foreeast means to the 
varianee of the mean error in the ensemble foreeast mean. The average varianee 
in eonstituent means is eomputed using 100 EoE foreeast oases eaoh with 100 
eonstituent foreoasts. The mean error is eomputed using 24,000 E96M EPS 
foreeast oases, where the varianee in mean error is found by oomputing the mean 
error over 3,000 subsets of eight foreoasts eaoh and taking the varianee. 
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Difference in 
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Forecast Lead Time (non dim) 


p‘ = 10% 



Figure 56. Validation of CESq (o) and RCR (*) total ambiguity across all forecast lead times 
for the specific test values (shown in Figure 37, page 108), which are labeled 

at the top of each panel. 
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(Figure 56 continued.) 
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Figure 57. Validation of CESq (o) and RCR (*) total ambiguity at select calibrated forecast 
probability values ( p]) (shown in Figure 37, page 108) for forecast lead times 
0.2-5.0 at an increment of 0.2. Lead times (r ) are labeled at the top of each 

panel. 
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(Figure 57 continued.) 
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(Figure 57 continued.) 
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(Figure 57 continued.) 
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(Figure 57 continued.) 
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(Figure 57 continued.) 
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Figure 58. Total ambiguity evolution for EoE (+), CES (o), and RCR (*) for ambiguity 

distributions with expected value of 50%. 




Eigure 59. Erequency of uncertain ensemble forecasts (i.e., control ensemble forecasts with 
pI between 0.1% and 99.9%) for (a) the common event of X > 6.31 and (b) the 
rare event of X > 9.98. The ensemble forecast for each variable from the first 
constituent of each EoE forecast case was utilized as a control ensemble forecast 

for a total of 800 forecasts. 
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Figure 60. Ambiguity distributions for EoE (solid) and CESq (dashed) with expeeted value 
equal to 50% for a single EoE foreeast ease at r = 5 time units for a single 
variable. The distributions are approximated using a beta-fit to the estimated 
foreeast probability values for eaeh teehnique. The upper (ElB) and low (EB) 
bounds of eaeh teehnique’s 90% Cl (i.e., total ambiguity) are labeled. 



Eigure 61. Ambiguity distributions for EoE (solid) and CESq (dashed) with expeeted value 

equal to 5%. Same as Eigure 60. 
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Forecast Lead Time (non dim) 


Figure 62. Comparison of validation of CESq without correction (o) and with correction (x) 
applied to the variance of the ME- distribution. The correction is based on the 
ratio of variance in EoE constituents’ location to variance in ME- (Eigure 55). 




Eigure 63. Integrated optimal value score \10VS, Equation (27), page 72] for the calibrated 
control ensemble forecast (solid) and the deterministic forecast (dashed) for (a) 
the common event and (b) the rare event. 
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Forecast Lead Tine (non-dim) 

Figure 64. Relative integrated optimal value seore [lOVS, Equation (27), page 72] using 
uneertainty-folding with EoE (dashed), CESq (dotted) and RCR (dot-dashed) for 

(a) the common event and (b) the rare event. The score for the grand ensemble 
(solid) is also shown in both panels. Error bars represent the 95% Cl found using 
resampling. Note the ordinate scale change between (a) and (b). 
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Figure 65. Control forecast probability well located with respect to the expected value of the 
EoE ambiguity distribution (80%). A histogram of Pj values for a single EoE 

forecast case (100 constituents) is shown with a Beta-fit curve for the RCR 
ambiguity distribution (solid line) created using the first constituent in the EoE 
forecast case as the control forecast. The control forecast probability (p* = 80% ) 

is marked by the dashed line. 



Forecast Probability (%) 


Eigure 66. Control forecast probability poorly located with respect to the expected value of 
EoE ambiguity distribution. Same as Eigure 65 with the expected value of the 
EoE ambiguity distribution at 20% and p* = 40%. 
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Figure 67. Optimal VS comparison for the GFS deterministic forecast (*) versus the GEFS 
forecast (o) using the application dataset of 50,220 forecast-observation pairs. 



Figure 68. Number of repeat false alarms for the control user at each C/L based on the 
application dataset of 50,220 forecast-observation pairs. 
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Figure 70. POD comparison for the control user (solid) and the always user (dashed) based 
on the application dataset of 50,220 forecast-observation pairs. The difference 
between the users becomes insignificant beyond C/L 90% (inset). 
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Figure 71. Optimal VS comparison for the control user (solid) versus the random user 
(dashed) based on the application dataset of 50,220 forecast-observation pairs. 



Figure 72. Repeat false alarm comparison for the control user (solid) versus the random user 
(dashed) based on the application dataset of 50,220 forecast-observation pairs. 
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Figure 73. POD comparison for the control user (solid) and the random user (dashed) based 
on the application dataset of 50,220 forecast-observation pairs. The difference 
between the users becomes insignificant beyond C/L 80% (inset). 
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Figure 74. Percent reduction in repeat false alarms from the control user using alternate 

decision rules in Table 6. Shown are the percent reduction for the optimal (solid), 
conceptual model (dashed), random (dot-dashed) and brash (dotted) users. The 
always user provided 100% reduction at all C/L and is not displayed. Results are 
based on the application dataset of 50,220 forecast-observation pairs 
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Figure 75. 


Figure 76. 


0.9 



Optimal VS eomparison for the eontrol user (solid) versus the brash user (dashed) 
based on the applieation dataset of 50,220 foreeast-observation pairs. 



Repeat false alarm eomparison for the eontrol user (solid) versus the brash user 
(dashed) based on the applieation dataset of 50,220 foreeast-observation pairs. 
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Figure 77. Optimal VS comparison for the control user (solid) versus the conceptual model 
user (dashed) based on the application dataset of 50,220 forecast-observation 

pairs. 



Figure 78. Repeat false alarm comparison for the control user (solid) versus the conceptual 
model user (dashed) based on the application dataset of 50,220 forecast- 

observation pairs. 
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Figure 79. POD comparison for the control user (solid) and the conceptual model user 
(dashed) based on the application dataset of 50,220 forecast-observation pairs. 
The inset indicates that the difference between the users becomes insignificant 

beyond C/L 12%. 



Figure 80. Optimal VS comparison for the control user (solid) versus the optimal user 
(dashed) based on the application dataset of 50,220 forecast-observation pairs. 
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Figure 81. Repeat false alarm eomparison for the eontrol user (solid) versus the optimal user 
(dashed) based on the applieation dataset of 50,220 foreeast-observation pairs. 


177 




THIS PAGE INTENTIONALLY LEET BLANK 


178 



V. CONCLUSIONS 


A. SUMMARY 

The primary tool for weather foreeasters today is the NWP model, whieh provides 
a detailed foreeast that unfortunately eontains signifieant uneertainty (i.e., random error) 
due to analysis and model errors. An ensemble predietion system (EPS) generates a 
flow-dependent estimate of that uneertainty to provide information eritieal to optimal 
deeision making. An ideal EPS will aeeount for all sourees of uneertainty assoeiated 
with a partieular deterministie modeling system. Today’s EPSs use a finite number of 
ensemble members and inadequate representation of the uneertainty assoeiated with the 
initial eonditions and model design. These defieieneies result in errors in the ensemble 
foreeast PDE, thus measures of foreeast uneertainty will be ineorreet, ineluding foreeast 
probability speeifie to an event eriterion. Thus, there is uneertainty in the estimation of 
foreeast uneertainty, a phenomenon termed ambiguity, whieh ean negatively impact the 
ability to optimize decisions. Ambiguity is the uncertainty surrounding the forecast 
probability, which can be described by a distribution of forecast probability values, 
referred to as an ambiguity distribution (NRC 2006; Eckel and Allen 2009). 

Ensemble forecasts can have high value in the decision making process. 
Numerous studies have shown the value of using probabilistic decision inputs over using 
deterministic or climatological information in the cost-loss (C/L) decision framework 
(e.g., Katz and Murphy 1997; Richardson 2000; Palmer 2002; Zhu et al. 2002). 
However, the possible additional value of using information about ambiguity has not 
been considered. In situations where the decision input is unclear, (due to ambiguity), an 
objective estimate of the ambiguity may be valuable to the user. 

The three objectives of this research were to: (1) understand the mechanisms 

behind the evolution of ambiguity associated with an ensemble forecast, (2) validate 

objective estimates of ambiguity associated with an EPS, and (3) explore methods of 

applying the ambiguity information to add value in decision making. All three objectives 

were accomplished using an EPS based on a low-order, chaotic dynamical system, where 
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our aim was to follow state-of-the-art praetiees in designing the low-order EPS so 
findings would refieet the performanee of real-world, operational EPSs. Additionally, 
real-world EPS data was used for exploring value in objeetive #3. 

To explore the research objectives, we used the low-order, chaotic dynamical 
system first introduced by Eorenz (1996) as a suitable proxy for the atmosphere. The 
system (E96) describes the evolution of variables on two distinct scales (Eorenz 1996; 
Wilks 2005). The small-scale variables are unresolved and thus parameterized in a model 
of the system (E96M) using a stochastic parameterization, providing a model with 
random error. Data assimilation for the control analysis was accomplished using a 
perturbed-observation Ensemble Kalman Eilter (EnKE) scheme. The L96M EPS used 
random draws from the EnKE members as its suite of initial conditions (IC). Model 
deficiencies were simulated in the EPS using the perturbed parameter approach applied 
through the stochastic parameterization, which randomly varied the parameter value for 
each member at every time step. Eor verification, ground truth was the solution from the 
complete E96 system. 

Ambiguity was estimated using three different techniques. The first technique, 
ensemble-of-ensemble (EoE), consisted of running multiple, parallel EPSs (constituents) 
for the same forecast case. The IC and model perturbations were varied within each 
constituent’s EPS, resulting in a spectrum of equally plausible ensemble forecast PDEs 
and a forecast probability PDE (i.e., ambiguity distribution) for any particular event at a 
given lead time. The EoE dynamically captures the EPS limitations (i.e., limited 
sampling and inadequate simulation of uncertainty), reflecting the EPS output’s 
sensitivity to the flow-dependent deficiencies in the perturbations associated with 
different regions in the model attractor. Since EoE is an impractical approach to 
estimating ambiguity, we also used two practical ambiguity estimation techniques, 
calibrated error sampling (CES) and randomly calibrated resampling (RCR). These 
techniques created ambiguity estimates using the long-term, average error characteristics 
of the first two moments in the ensemble PDE, mean error of the ensemble mean {ME-) 
and fractional error in ensemble spread (cr'), as proxies for the relationships among EoE 
constituent PDEs. 


180 



The CES method took two forms, CESq (global) and CES^ (local). CESq used 
50,000 sets of random draws from the long-term, average distributions for ME-, <j' and 
ensemble spread to create a distribution composed from 50,000 possible values of true 
forecast probability for any value of calibrated forecast probability. CESq produced a 

bulk (generic) ambiguity estimate, independent of ensemble spread, that could come from 
any event since the EPS characteristics are taken as the same across the entire attractor. 
CESl provided a somewhat flow-dependent ambiguity estimate by following similar 

processing as CESq but for specific values of ensemble spread. Thus the CES^ 
ambiguity estimate for a certain calibrated forecast probability value is different for 
different values of ensemble spread. (Note; CES^ was actually developed in response to 
the evolution and validation discoveries in this research so was omitted from validation 
but applied in the value studies). 

RCR produces a somewhat flow-dependent ambiguity estimate using bootstrap 
resampling of the ensemble members. A distribution of 10,000 possible values of 
forecast probability is produced by generating 10,000 different versions of the members 
at each forecast point by resampling with replacement. This process accounts for limited 
sampling of the true forecast PDE due to the finite number of members in the EPS, and 
the ambiguity estimate is dependent on the number of members (i.e., fewer members give 
higher ambiguity). Eor RCR, each set of resampled members is calibrated using random 
coefficients drawn from the distributions for mean and fractional error of the ensemble 
PDE, which removes systematic error and brings in solutions missed by the original 
members due to EPS deficiencies. The RCR ambiguity distribution is generally wider 
than would be found using resampling alone. 

The evolution of ambiguity was explored using the EoE, as it produces our best 
estimate of ambiguity. Ambiguity was found to be highest early in the forecast period 
and then decrease quickly during peak forecast error growth. We found the primary 
influences on ambiguity magnitude to be the variability in location of the constituents’ 
PDEs and the uncertainty (i.e., ensemble spread) of the constituents’ PDEs. When the 
ratio of variance in constituents’ locations to ensemble variance is large (typically early 
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in the forecast), large differences in forecast probability may be seen (i.e., high 
ambiguity) since changes in probability density relative to an event threshold are more 
sensitive to location changes when spread is low (Figure 51, page 147). Later in the 
forecast, the disproportionately larger increase in ensemble variance (due to error growth) 
compared to the separation between constituents’ PDFs results in a narrower range of 
forecast probabilities (i.e., low ambiguity), as probability density shifts amongst the 
constituents are similar. Since ensemble variance plays a significant role in the 
production of ambiguity, our results suggest that ambiguity generally evolves from high 
to low values as a result of the typical increase in ensemble spread with forecast lead 
time. Of course, it is possible on a case-by-case basis for an ensemble forecast to exhibit 
small spread at any lead time resulting in large ambiguity. However, the general 
conclusion is that ambiguity is a function of both ensemble spread and the sensitivity of 
forecast probability estimates to errors in PDF location. The irony of this finding is that 
sharper ensemble PDFs are generally considered to reflect better performance, but 
ambiguity can be greatly increased by the sensitivity to errors in location with sharper 
forecasts. 

Validation was performed using aggregated CESq and RCR ambiguity 

distributions built over many locations on the L96M attractor to determine the overall 
effectiveness of the estimates in comparison to EoE. However, we could not validate the 
estimation methods’ ability to consistently capture the location of the EoE ambiguity 
distribution since a random error in location generally exists between EoE and the CESq 

and RCR distributions. Validation showed how well CESq and RCR captured the 

variance of the EoE ambiguity distribution. Comparisons made using the total ambiguity 
[Equation (26), page 61] of each method’s aggregated ambiguity distributions indicated 
the following trends: 

• The ambiguity distributions from the practical estimation techniques 

appeared to perform very poorly at early forecast lead times with total ambiguity 

differences near 30%, but each showed improvement with time; 
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• The largest differences in total ambiguity appear to have occurred with 
mid-range forecast probability values; 

• The CESq total ambiguity was too narrow in relation to the aggregated 

EoE ambiguity distributions regardless of forecast lead time or forecast 
probability value tested; 

• The RCR total ambiguity was too narrow during the early forecast lead 
times, but then transitioned to become slightly too wide later in the forecast for 
most of the forecast probability values tested. 

The apparent disparity in performance of the CESq and RCR estimates at mid¬ 
range and extreme forecast probability values is simply a result of the lower and upper 
bounds (i.e., 0% and 100%, respectively) confining the range of possible forecast 
probability values. In general, we expect to see tighter ambiguity distributions for the 
extreme forecast probability values, thus total ambiguity is naturally smaller. 
Additionally, as the expected value of the ambiguity distribution approaches either 
extreme, the total ambiguity difference between the EoE and the CES or RCR estimates 
used for validation becomes more one-sided reducing the difference. Eor example, when 
the expected value approaches 0%, the lower bounds of each estimation method’s 
ambiguity distributions become more similar, thus differences in total ambiguity are 
found primarily in the upper bounds. 

We found a leading factor in the under-spread CESq ambiguity distributions was 

the absence of flow-dependent ensemble spread. The typical ensemble variance used 
when estimating the forecast probability values was near the long-term average, which in 
many cases would likely be too high compared to the flow-dependent variance, 
producing a more narrow ambiguity distribution. Additionally, the variance of the ME- 

error distribution used to create the CESq ambiguity distributions was inadequate, 
particularly at the early forecast lead times, thus the CESq sample forecast PDEs were 
not sufficiently separated to produce a wide enough ambiguity distribution. Therefore, 
CESq often underestimates the total ambiguity. 
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Similar to CESq , the RCR ambiguity distributions were highly under-spread 
likely due to the low varianee of the ME- error distribution. Even though RCR eonsiders 

the flow-dependent ensemble spread, the RCR PDEs eould not adequately separate to 
generate the ambiguity levels provided by EoE. RCR total ambiguity reeovered later in 
the foreeast due to the improvement in varianee of the ME- error distribution as well as 

its applieation of flow-dependent ensemble spread. The random ealibration was likely 
the eause for slightly exeessive ambiguity estimates later in the foreeast. 

In general, ambiguity found using CESq and RCR evolved similarly to that of 
EoE (i.e., from high to low values), but the magnitude of the ambiguity early in the 
foreeast was notably lower eompared to EoE. We eoneluded that the varianee in ME- 

(as used by CESq and RCR) underestimated the variation in possible ensemble foreeast 
PDE loeations found using the EoE (i.e., the eonstituents’ PDEs), thus limiting the 
varianee of the ambiguity distributions, espeeially early in the foreeast. However, we 
found the praetieal ambiguity distributions to be reasonably aeeurate estimates of the total 
ambiguity onee error growth exeeeded approximately 10% of the elimatological varianee. 
In eases where error growth is below 10%, ambiguity generally inereases as the ensemble 
foreeast PDE gets sharper, but for sharper PDEs, ambiguity is less often a faetor sinee any 
given event is more eertain (i.e., foreeast probability eloser to 0% or 100%). Therefore, 
we eonelude that the CESq and RCR ambiguity distributions are likely good enough to 
provide valuable information to the deeision proeess. 

This researeh introdueed two approaehes for attempting to add value to the 
deeision making proeess using objeetive ambiguity estimates. The first approaeh, 
uneertainty-folding, eombines the first- and seeond-order uneertainty information to onee 
again give the user a single probabilistie deeision input based on the weather information. 
We performed uneertainty-folding using ambiguity distributions from the EoE, CESq 

and RCR estimation techniques. We also tested a grand ensemble where all constituent 
members for a single EoE forecast case were combined to produce a single forecast 
probability value. These four decision input sources were compared in relation to the 
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value provided by basing decisions on the control ensemble forecast probability alone. 
Results for two event thresholds (representing a common and a rare event) were 
examined. 

For both events, the integrated optimal value score (VS) found using the EoE and 
the grand ensemble showed improvement over the control ensemble forecast, but results 
were only significant for the common event at lead times beyond maximum error growth. 
The grand ensemble and EoE generally performed the same, indicating that resources 
may be better spent reducing ambiguity by running a larger EPS than estimating 
ambiguity with an impractical approach like EoE. The scores found using uncertainty¬ 
folding with the CESq and RCR were generally not significantly different from the 

control ensemble. Since the CESq and RCR ambiguity distributions are centered on the 
control forecast probability, the probability value computed using uncertainty-folding 
will not vary greatly from the control value, which prevented significant improvement in 
value. Additionally, random error in the location of the practical techniques ambiguity 
distributions produced errors when combining the first- and second-order uncertainty, 
likely reducing the value of the decision input in normative decision making. Thus 
uncertainty-folding may not be a useful approach to gamer value from ambiguity since it 
only works well for EoE, the impractical method of ambiguity estimation. 

Eor the second method used to attain value using the ambiguity information, we 
looked at improving secondary criteria important to the decision-maker beyond the 
primary value (tied to minimizing total expense). The example secondary criteria 
considered was repeat false alarms, so the objective was to use the ambiguity information 
to significantly reduce the number of repeat false alarms while maintaining the primary 
value (measured by optimal VS, as well as probability of detection and probability of 
missed detection) associated with normative decision making within the C/L scenario. 
Several user decision rules were studied using real-world ensemble forecast data from 
National Center for Environmental Prediction’s Global Ensemble Eorecast System, where 
the different users were allowed to reverse the current decision of taking protective action 
if and only if a false alarm had just occurred at the same location and their decision 
criteria was met. 
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Of all the decision rules studied, the two that considered ambiguity (via an 
overlap threshold) when reversing decisions outperformed the others at significantly 
reducing repeat false alarms while maintaining the primary value. The overlap is the 
proportion of the ambiguity distribution indicating a different decision than the normative 
input; thus the overlap threshold represents the value of overlap at which the user 
reverses decisions. We saw the best overall performance by the user who followed the 
optimal overlap threshold. Developed from a training dataset, the optimal overlap 
threshold for each C/L was the lowest threshold giving the greatest reduction in repeat 
false alarms that resulted in no significant reduction in primary value. Although the 
conceptual model had significantly fewer repeat false alarms than the optimal user for 
low C/L, the optimal user faired better in regards to primary value than the conceptual 
model since it prevented excessive reversals, thus avoiding a large increase in misses 
(i.e., expense). For mid-range and high C/L, there was no significant difference between 
the optimal and conceptual model users. 

The results clearly show that we can attain tremendous improvements to 
secondary criteria by employing an objective ambiguity estimate in decision making. 
Moreover, we were able to train our decision process based on past performance to 
optimally select an overlap threshold at each C/L. Using the flow-dependent CES^ 

estimates for this study (instead of CESq which inherently underestimated ambiguity), 
likely played a large role in attaining significant value for the secondary criteria. 

B. FUTURE RESEARCH 

The results presented in this research suggest several areas of future research, the 
first of which is to perform a validation study using the CES^ estimation method. The 
refinements made to include flow-dependence are likely to improve ambiguity estimation 
for CESl compared to CESq , especially at later forecast lead times when the low ME- 
variance played less of a role in degrading the estimates. However, the inclusion of flow- 
dependent ensemble spread at early times may allow CES^ to produce a wider range of 
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forecast probabilities, thus improving its estimates compared to EoE. Eike RCR, CES^ 
validation will require aggregates of ambiguity estimates specific to EoE forecast cases. 

The next subject for research is continued investigation of the method used to 
determine the variance of the error distributions, i.e., sub-setting of the long-term 
verification dataset. The variance of the error distributions is obviously dependent on the 
size of the subset, and properly determining subset size is non-trivial. Eor this research, 
sub-setting was based on complete EPS runs to capture flow-dependent error 
characteristics, which appears to be inadequate. 

A related area of future research involves the implications of the spread-skill 
relationship in CES^. CES^ ambiguity estimates found using the domain averaged ME- 

variance in this research ignored the spread-skill relationship, but obtained reasonable 
and ultimately valuable estimates. However, for a well-calibrated EPS, the correlation 
between ensemble spread and ensemble mean error variance is nearly perfect (as seen in 
a binned spread-skill plot), so CES^ should perhaps use that information in estimating 

ambiguity. In that case, ambiguity would be similar regardless of the ensemble spread 
value since the variability in location error is proportional. Additionally, ambiguity 
would be much larger and overestimated in most cases given such a large ratio in the 
variances. Research is thus needed to resolve this contradiction. 

Eurther research should be conducted using the correction to ME- variance 
provided by the ratio in Eigure 55 (page 150) (i.e., comparison of variance in constituent 
location to ME- variance) with the CES^ method. Greater improvements are expected 

than those seen with CESq in Eigure 62 (page 166), due to the flow-dependence of 
CES^^. If corrections to the total ambiguity are nearly perfect at all lead times, further 
investigations may be performed using a different low-order model to determine if a 
general relationship (i.e., correction) exists between ME- variance and constituent 
location variance that may be used for higher order models. 

Several subjects for future research involve utilization of the ambiguity 
information in decision making. While using the practical methods with uncertainty- 
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folding may not have provided signifieant improvements in value using the optimal 
integrated VS (a eombination of all users), it may be benefieial to perform a more 
thorough evaluation. Uneertainty-folding should be used with CES^ and RCR ambiguity 
estimates for events at specific lead times, thus allowing analysis of the results for 
specific users (i.e., C/L) instead of integrating all users into the optimal integrated VS. 
Future uncertainty-folding studies should also include real-world EPS data. 

We investigated just one of many possible secondary criteria, but future research 
in this area of value is nearly unlimited. Studying different secondary criteria entails 
developing methods to measure primary and secondary value, as well as determining 
methods for optimization of the decision process. The value of some secondary criteria 
may be hard to assess. For example, mission effectiveness (primary value) may be 
evaluated through battle damage assessment, but the intangible benefits such as improved 
morale (secondary value) that come with a successful mission are hard to quantify. In 
this case, the user may choose an alternate strike location with a greater chance of success 
to hopefully improve morale. Additionally, we looked at a single secondary criterion in 
isolation, but it may be equally important to the customer to consider multiple criteria 
(e.g., repeat false alarms and repeat misses). 
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APPENDIX: 


FIGURE SEQUENCE DISPLAYING THE TIME 
EVOLUTION OF AMBIGUITY 


This appendix includes figures referenced in Chapter IV.A showing the evolution 
of ambiguity with increasing forecast lead time for a single EoE forecast case of 100 
constituents using an arbitrary variable. The ambiguity distributions were 
determined for each forecast lead time using an Y-value event threshold that resulted in 
E (Pj.) = 50%, thus the event threshold was different for each forecast lead time. The 

histograms of constituent forecast probability values were created using a class interval of 
1% over the range 0%-100%. Constituent PDEs were generated using a normal fit to the 
n ensemble members in each constituent ensemble forecast. Note that the abscissa range 
is fixed for all figures in both (a) and (b), while the ordinate range may vary based on the 
data. 


T =0.2 



figure 82. EoE ambiguity evolution showing (a) the histogram of constituent forecast 
probability values and (b) the constituent PDEs used to find each forecast 
probability for forecast lead times 0.2 to 5 at 0.2 increment (labeled at the top of 
each panel). The total ambiguity for this panel equals 85% (7% to 92%). 
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T =3.2 



Forecast Probability (%) X 

82 continued.) The total ambiguity for this panel equals 32% (34% to 66%). 


i:=3.4 



Forecast Probability (%) 

(Figure 82 continued.) The total ambiguity for this panel equals 30% (35% to 65%). 
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i:=3.6 



Forecast Probability (%) X 

82 continued.) The total ambiguity for this panel equals 28% (36% to 64%). 


i:=3.8 



Forecast Probability (%) X 

(Figure 82 continued.) The total ambiguity for this panel equals 32% (34% to 66%). 


198 






Frequency 












































































THIS PAGE INTENTIONALLY LEET BLANK 


202 



LIST OF REFERENCES 


Anderson, J. L., 1996: A method for produeing and evaluating probabilistie foreeasts 
from ensemble model integrations. J. Climate, 9 , 1518-1530. 

Anderson, J. L., 1997: The impaet of dynamieal eonstraints on the seleetion of initial 

eonditions for ensemble predietions: Low-order perfeet model results. Mon. Wea. 
Rev., 125 , 2969-2983. 

Anderson, J. L., 2003: A loeal least squares framework for ensemble filtering. Mon. Wea. 
Rev., 131 , 634-642. 

Anderson, J. L., S. L. Anderson, 1999: A Monte Carlo implementation of the nonlinear 
filtering problem to produee ensemble assimilations and foreeasts. Mon. Wea. 

Rev., 127 , 2741-2758. 

Bowler, N. E., 2006: Comparison of error breeding, singular veetors, random 

perturbations and ensemble Kalman filter perturbation strategies on a simple 
model. TellusA, 58 , 538-548. 

Brooks, H. E., C. A. Doswell, 1993: New teehnology and numerieal weather predietion— 
a wasted opportunity? Weather, 48 , 173-177. 

Buizza, R., 1997: Potential foreeast skill of ensemble predietion and spread and skill 

distributions of the ECMWE ensemble predietion system. Mon. Wea. Rev., 125 , 
99-119. 

Buizza, R., M. Miller, and T. N. Palmer, 1999: Stoehastie representation of model 

uneertainties in the ECMWE Ensemble Predietion System. Quart. J. Roy. Meteor. 
Soc., 125 , 2887-2907. 

Burgers, G., P. Jan van Eeeuwen, and G. Evensen, 1998: Analysis seheme in the 
ensemble Kalman filter. Mon. Wea. Rev., 126 , 1719-1724. 

Camerer, C., M. Weber, 1992: Reeent developments in modeling preferenees: 

Uneertainty and ambiguity. Journal of Risk and Uncertainty, 5 , 325-370. 

Cohn, S. E., 1997: An introduetion to estimation theory. J. Meteor. Soc. Japan, 75, 257- 
288. 

Deseamps, E., O. Talagrand, 2007: On some aspeets of the definition of initial eonditions 
for ensemble predietion. Mon. Wea. Rev., 135 , 3260-3272. 


203 



Ebert, E. E., 2001; Ability of a poor man's ensemble to predict the probability and 
distribution of precipitation. Mon. Wea. Rev., 129 , 2461-2480. 

Eckel, E.A., 2003; Effective Mesoscale, Short-Range Ensemble Eorecasting. Ph.D. 

Dissertation, University of Washington Department of Atmospheric Sciences, 
Seattle, WA., 224 pp. 

Eckel, E. A., C. Mass, 2005; Aspects of effective mesoscale, short-range ensemble. Wea. 
Forecasting, 20 , 328-350. 

Eckel, E.A., M.S. Allen, Draft 2009; Estimating Ambiguity in Ensemble Eorecasts. 
Submitted to Weather and Eorecasting. 

ECMWE, cited 2009; Singular Vectors; Einear Perturbation Growth [Available online 
http;//www.ecmwf.int/research/r)redictabilitv/proiects/IC pert/SV method/index.htmll 

(Accessed July 30, 2009). 

Ellsberg, D., 1961; Risk, ambiguity, and the Savage axioms. Quart. J. Econ., 75 , 643- 
669. 

Evans, R. E., M. S. J. Harrison, R. J. Graham, and K. R. Mylne, 2000; Joint medium- 
range ensembles from The Met. Office and ECMWE systems. Mon. Wea. Rev., 
128,3104-3127. 

Evensen, G., 1994; Sequential data assimilation with a nonlinear quasi-geostrophic model 
using Monte Carlo methods to forecast error statistics. J. Geophys. Res., 99 , 
10143-10162. 

Evensen, G., 1997; Advanced data assimilation for strongly nonlinear dynamics. Mon. 
Wea. Rev., 125 , 1342-1354. 

Hamill, T. M., 2001; Interpretation of ra nk histograms for verifying ensemble forecasts. 
Mon. Wea. Rev., 129 , 550-560. 

Hamill, T. M., 2006; Ensemble-based atmospheric data assimilation. Predictability of 
weather and climate, T. Palmer and R. Hagedom, Eds., Cambridge University 
Press, 124-156. 

Hamill, T. M., S. J. Colucci, 1997; Verification of Eta-RSM short-range ensemble 
forecasts. Mon. Wea. Rev., 125 , 1312-1327. 

Hamill, T. M., C. Snyder, and R. E. Morss, 2000; A comparison of probabilistic forecasts 
from bred, singular-vector, and perturbed observation ensembles. Mon. Wea. Rev., 
128 , 1835-1851. 


204 



Hamill, T. M., J. S. Whitaker, and C. Snyder, 2001: Distanee-dependent filtering of 

baekground error eovarianee estimates in an ensemble Kalman filter. Mon. Wea. 
Rev., 129 , 2776-2790. 

Hansen, J.A., 2009: Personal eorrespondenee. 

Hansen, J. A., C. Penland, 2006: Effieient approximate teehniques for integrating 
stoehastie differential equations. Mon. Wea. Rev., 134 , 3006-3014. 

Houtekamer, P. L., J. Derome, 1995: Methods for ensemble predietion. Mon. Wea. Rev., 
123,2181-2196. 

Houtekamer, P. L., H. L. Mitehell, 1998: Data assimilation using an ensemble Kalman 
fdter teehnique. Mon. Wea. Rev., 126 , 796-811. 

Jolliffe, 1. T., D. B. Stephenson, 2003: Forecast verification: a practitioner's guide in 
atmospheric science. John Wiley & Sons, Ltd., 240 pp. 

Kalman, R. E., 1960: A new approaeh to linear filtering and predietion problems. J. Basic 
Eng., 82 , 35-45. 

Kalnay, E., 2003: Atmospheric modeling, data assimilation and predictability. 

Cambridge University Press, 341 pp. 

Katz, R. W., A. H. Murphy, 1997: Economic value of weather and climate forecasts. 
Cambridge University Press, 222 pp. 

Leith, C. E., 1974: Theoretieal skill of Monte Carlo foreeasts. Mon. Wea. Rev., 102 , 409- 
418. 

Leutbeeher, M., T. N. Palmer, 2008: Ensemble foreeasting. Journal of Computational 
Physics, 227, 3515-3539. 

Lewis, J. M., 2005: Roots of ensemble foreeasting. Mon. Wea. Rev., 133 , 1865-1885. 

Lorene, A. C., 2003: The potential of the ensemble Kalman filter for NWP-a eomparison 
with 4D-Var. Quart. J. Roy. Meteor. Soc., 129 , 3183-3203. 

Lorenz, E. N., 1963: Deterministie nonperiodie flow. J. Atmos. Set, 20 , 130-141. 

Lorenz, E. N., 1969: The predietability of a flow whieh possesses many scales of motion. 
Tellus, 21 , 289-307. 

Lorenz, E. N., 1993: The essence of chaos. University of Washington Press, 227 pp. 


205 



Lorenz, E. N., 1996: Predictability-a problem partly solved. Proc. Proc. Seminar on 
Predictability, 1-18. 

Magnusson, L., E. Kallen, and J. Nyeander, 2008: Initial state perturbations in ensemble 
foreeasting. Nonlin. Processes Geophy., 15 , 751-759. 

Mullen, S. E., R. Buizza, 2002: The impaet of horizontal resolution and ensemble size on 
probabilistie foreeasts of preeipitation by the ECMWF ensemble predietion 
system. Wea. Forecasting, 17 , 173-191. 

Murphy, A. H., 1985: Deeision making and the value of foreeasts in a generalized model 
of the eost-loss ratio situation. Mon. Wea. Rev., 113 , 362-369. 

Mylne, K. R., R. E. Evans, and R. T. Clark, 2002: Multi-model multi-analysis ensembles 
in quasi-operational medium-range foreeasting. Quart. J. Roy. Meteor. Soc., 128 , 
361-384. 

National Researeh Couneil Committee on Estimating and Communieating Uneertainty in 
Weather and Climate Foreeasts, 2006: Completing the Forecast: Characterizing 
and Communicating Uncertainty for Better Decisions Using Weather and Climate 
Forecasts. National Aeademies Press, 112 pp. 

Nutter, P., D. Stensrud, and M. Xue, 2004a: Effeets of eoarsely resolved and temporally 
interpolated lateral boundary eonditions on the dispersion of limited-area 
ensemble foreeasts. Mon. Wea. Rev., 132 , 2358-2377. 

Nutter, P., M. Xue, and D. Stensrud, 2004b: Applieation of lateral boundary eondition 
perturbations to help restore dispersion in limited-area ensemble foreeasts. Mon. 
Wea. Rev., 132 , 2378-2390. 

Orrell, D., 2003: Model error and predietability over different timeseales in the Eorenz'96 
systems. J. Atmos. Set, 60 , 2219-2228. 

Palmer, T. N., 2002: The eeonomie value of ensemble foreeasts as a tool for risk 

assessment: From days to deeades. Quart. J. Roy. Meteor. Soc., 128 , 747-774. 

Reichle, R. H., J. P. Walker, R. D. Koster, and P. R. Houser, 2002: Extended versus 
ensemble Kalman filtering for land data assimilation. J. Hydrometeor, 3 , 728- 
740. 

Riehardson, D. S., 2000: Skill and relative eeonomie value of the ECMWF ensemble 
predietion system. Quart. J. Roy. Meteor. Soc., 126 , 649-668. 

Riehardson, D. S., 2001: Ensembles using multiple models and analyses. Quart. J. Roy. 
Meteor. Soc., 127 , 1847-1864. 


206 



Shutts, G., 2005: A kinetic energy backscatter algorithm for use in ensemble prediction 
systems. Quart. J. Roy. Meteor. Soc., 131 , 3079-3101. 

Sivillo, J. K., J. E. Ahlquist, and Z. Toth, 1997: An ensemble forecasting primer. Wea. 
Forecasting, 12 , 809-818. 

Szczes, J.R., 2008: Communicating Optimized Decision Input from Stochastic 

Turbulence Forecasts. M.S. Thesis, Graduate School of Engineering and Applied 
Sciences, Naval Postgraduate School. 159 pp. [Available from the Defense 
Technical Information Center]. 

Szunyogh, I., Z. Toth, 2002: The effect of increased horizontal resolution on the NCEP 
global ensemble mean forecasts. Mon. Wea. Rev., 130 , 1125-1143. 

Tallagrad, O., R. Vautard, and B. Strauss, 1997: Evaluation of probabilistic prediction 
systems. Proc. from the Workshop on Predictability, Anonymous European 
Center for Medium-Range Weather Eorecasts, 1-25. 

TIGGE, cited 2009: Thorpex Interactive Grand Global Ensemble [Available online: 
http://tigge.ecmwfint/l (Accessed July 30, 2009). 

Tippett, M. K., J. L. Anderson, C. H. Bishop, T. M. Hamill, and J. S. Whitaker, 2003: 
Ensemble square root filters. Mon. Wea. Rev., 131 , 1485-1490. 

Toth, Z., E. Kalnay, 1993: Ensemble forecasting at NMC: The generation of 
perturbations. Bull. Amer. Meteor. Soc., 74 , 2317-2330. 

Toth, Z., Y. Zhu, I. Szunyogh, M. Iredell, and R. Wobus, 2002: Does increased model 
resolution enhance predictability? Preprints. Proc. Symp. on Observations, Data 
Assimilation, and Probabilistic Prediction, Orlando, PE. 

Tracton, M. S., E. Kalnay, 1993: Operational ensemble prediction at the National 
Meteorological Center: practical aspects. Wea. Forecasting, 8 , 379-398. 

Tribbia, J. J., D. P. Baumhefner, 2004: Scale interactions and atmospheric predictability: 
An updated perspective. Mon. Wea. Rev., 132 , 703-713. 

Wallsten, T. S., 1990: Measuring vague uncertainties and understanding their use in 

decision making. G.F. Furstenberg, Ed., Kluwer Academic Publishers, 377-399. 

Wang, X., C. H. Bishop, 2003: A comparison of breeding and ensemble transform 
Kalman filter ensemble forecast schemes. J. Atmos. ScL, 60 , 1140-1158. 


207 



Wei, M., Z. Toth, R. Wobus, Y. Zhu, C. Bishop, and X. Wang, 2006; Ensemble 

Transform Kalman Filter-based ensemble perturbations in an operational global 
prediction system at NCEP. Tellus, 58 , 28-44. 

Weisstein, Eric W. "Runge-Kutta Method." Vxom MathWorld-A Wolfram Web 

Resource. http://mathworld.wolfram.com/Runge-KuttaMethod.html (Accessed 
July 30, 2009). 

Whitaker, J. S., T. M. Hamill, 2002; Ensemble data assimilation without perturbed 
observations. Mon. Wea. Rev., 130 , 1913-1924. 

Wilks, D. S., 2005; Effects of stochastic parametrizations in the Lorenz'96 system. Quart. 
J. Roy. Meteor. Soc., 131 , 389-407. 

Wilks, D. S., 2006; Statistical methods in the atmospheric sciences. 2nd ed. Academic 
Press, 467 pp. 

WMO, cited 2009; The Observing System Research and Predictability Experiment 
[Available online; 

http://www.wmo.int/pages/prog/arep/wwrp/new/thorpex new.htmll (Accessed 
July 30, 2009). 

Zhu, Y., Z. Toth, R. Wobus, D. Richardson, and K. Mylne, 2002; The economic value of 
ensemble-based weather forecasts. Bull. Amer. Meteor. Soc., 83 , 73-83. 

Ziehmann, C., 2000; Comparison of a single-model EPS with a multi-model ensemble 
consisting of a few operational models. Tellus A, 52 , 280-299. 


208 




INITIAL DISTRIBUTION LIST 


1. Defense Technical Information Center 
Ft. Belvoir, Virginia 

2. Dudley Knox Library 
Naval Postgraduate School 
Monterey, California 

3. Air Force Weather Technical Library 

14* Weather Squadron 
Asheville, North Carolina 

4. Major Tony Eckel 
Naval Postgraduate School 
Monterey, California 

5. Dr. Wendell Nuss 

Naval Postgraduate School 
Monterey, California 

6. Dr. Patrick Harr 

Naval Postgraduate School 
Monterey, California 

7. Dr. Eva Regnier 

Naval Postgraduate School 
Monterey, California 

8. Dr. James Hansen 
Naval Research Eab 
Monterey, California 

9. Dr. Philip Durkee 

Naval Postgraduate School 
Monterey, California 


209 



